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Abstract We present a simple and fast geometric method 
for modeling data by a union of affine subspaces. The method 
begins by forming a collection of local best-fit affine sub- 
spaces, i.e., subspaces approximating the data in local neigh- 
borhoods. The correct sizes of the local neighborhoods are 
determined automatically by the Jones' (5% numbers (we prove 
under certain geometric conditions that our method finds the 
optimal local neighborhoods). The collection of subspaces is 
further processed by a greedy selection procedure or a spec- 
tral method to generate the final model. We discuss applica- 
tions to tracking-based motion segmentation and clustering 
of faces under different illuminating conditions. We give ex- 
tensive experimental evidence demonstrating the state of the 
art accuracy and speed of the suggested algorithms on these 
problems and also on synthetic hybrid linear data as well 
as the MNIST handwritten digits data; and we demonstrate 



how to use our algorithms for fast determination of the num- 
ber of affine subspaces. 
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1 Introduction 

Several problems from computer vision, such as motion seg- 
mentation and face clustering, give rise to modeling data 
by multiple subspaces. This is referred to as Hybrid Lin- 
ear Modeling (HLM) or alternatively as "subspace cluster- 
ing". In tracking-based motion segmentation, extracted fea- 
ture points (tracked in all frames) are clustered according to 
the different moving objects. Under the affine camera model, 
the vectors of coordinates of feature points corresponding to 
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a moving rigid object lie on an affine subspace of dimension 
at most 3 (see [8 1). Thus clustering different moving objects 
is equivalent to clustering different affine subspaces. Sim- 
ilarly, in face clustering, it has been proved that the set of 
all images of a Lambertian object under a variety of light- 
ing conditions form a convex polyhedral cone in the im- 
age space, and this cone can be accurately approximated 
by a low-dimensional linear subspace (of dimension at most 
9) ni2U16l l2l. One may thus cluster certain images of faces 
by HLM algorithms. 

The mathematical formulation of HLM assumes a data 
set X = C M. where each jq lies on (or around) 

one of K flats (i.e., affine subspaces) and requires to find 
the partition of X corresponding to the flats. We would like 
to be able to do this when the data has been corrupted by 
additive noise and outlier^; in this case we may also want 

1 Throughout the paper outliers are corrupted data points, i.e., points 
generated by a distribution, which assigns sufficiently small probabil- 
ity for small neighborhoods around the underlying subspaces. This is 
different than corrupting selected entries of data points. 
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to determine the flats themselves. We first assume here that 
all flats have the same known dimension d (i.e., they are d- 
flats) and that their number K is known. In Section [3] we 
address to some extent the cases of unknown K and mixed 
dimensions. 

Several algorithms have been suggested for solving the 
HLM problem (or even the more general problem of clus- 
tering manifolds), for example the if -flats (KF) algorithm 
or any of its variants 1 34,4 36 , 16 44 1, methods based on di- 
rect matrix factorization [3.8.. 18.. 19 1, Generalized Principal 
Component Analysis (GPCA) [38|, Local Subspace Affinity 
(LSA) gO), RANSAC (for HLM) E), Locally Linear Man- 
ifold Clustering (LLMC) 1151 . Agglomerative Lossy Com- 
pression (ALC) [27 1, Spectral Curvature Clustering (SCC) 
and Sparse Subspace Clustering (SSC) Hill . Some the- 
oretical guarantees for particular HLM algorithms appear 
in ll5l[Tl l24ll3Tl . We recommend a recent review on HLM by 
Vidal 11371 . 

Many of the algorithms described above require an ini- 
tial guess of the subspaces. For example, the if-flats algo- 
rithm is an iterative method that requires an initialization, 
and in SCC, one needs to carefully choose collections of 
d + 1 data points that lie close to each of the underlying d- 
flats. Other algorithms require some information about the 
suspected deviations from the hybrid linear model; for ex- 
ample both RANSAC (for HLM) and ALC ask for a model 
parameter corresponding to the level of noise. 

Here we propose a straightforward geometric method for 
the estimation of local subspaces, which is inspired by ifTTl 
[I0ll22l and IU3U26ll25ll . These local subspace estimates can 
be used to set the model parameters for or initialize an HLM 
algorithm. The basic idea is that for a data set X sampled 
from a hybrid linear model (perhaps with some noise), there 
are many points x such that the principal components of 
an appropriately sized neighborhood of x give a good ap- 
proximation to the subspace x belongs to. Using local sub- 
spaces to infer the global hybrid linear model was suggested 
in (40] for linear subspaces; however, there they use very 
small neighborhoods that are not adaptive to the structure of 
the data (e.g., amount of noise etc.). An "appropriately sized 
neighborhood" needs to be larger than the noise, so that the 
subspace is recognized. However, the neighborhood cannot 
be so large that it contains points from multiple subspaces. 
The correct choice of this size is carefully quantified in Sec- 
tion o 

In addition to studying how to estimate local subspaces, 
we describe two complete HLM algorithms which are nat- 
ural extensions of the local estimation: LBF (Local Best-fit 
Flat) and SLBF (spectral LBF). On many data sets, the first 
obtains state of the art speed with nearly state of the art ac- 
curacy (it can also deal with very large data), and the sec- 
ond obtains state of the art accuracy (SLBF) with reason- 
able run times (it seems to be able to deal to some extent 



with some nonlinear structures as the ones arising in motion 
segmentation data). We remark that we test accuracy in var- 
ious scenarios, but in particular, with intersecting subspaces 
and with outliers. While in this work we only theoretically 
justify our choice of initializer, we are hopeful about devel- 
oping a more complete theory justifying our algorithms. 

In particular, we believe that such a theory can be valid 
in the setting suggested by Soltanolkotabi and Candes BP 
for analyzing the SSC algorithm, while having additional 
noise and restricting the fraction of outliers (or modifying 
our algorithms so they are even more robust to outliers). We 
are also interested in rigorously quantifying the limitations 
of our algorithms (as conjectured in SectionH]). 

We summarize the main contributions of this work, which 
is the full length version of B31 . as follows. 

• We make precise the local best-fit heuristic, using the fa 
numbers 117lll0ll22l . We give an algorithm to approx- 
imately find optimal neighborhoods in the above sense, 
in fact, we prove this under certain geometric conditions. 

• Using the local best-fit heuristic, we introduce the LBF 
and SLBF algorithms for HLM. At each point of a ran- 
domly chosen subset of the data, they use the best-fit 
flats of the "optimal" neighborhoods to build a global 
model with different methods (LBF is based on energy 
minimization and SLBF is a spectral method). 

• We perform extensive experiments on motion segmen- 
tation data (the Hopkins 155 benchmark of [35]), face 
clustering (the extended Yale face database B), hand- 
written digits (the MNIST database), and artificial data, 
showing that both algorithms, in particular SLBF, are ac- 
curate on real and synthetic HLM problems, while LBF 
runs extremely fast (often on the order of ten times faster 
than most of the previously mentioned methods). For the 
cropped face data we actually indicate a fundamental 
problem of local methods like LBF and SLBF, though 
suggest a workaround that works for this particular data. 

• We demonstrate how the local best-fit heuristic can be 
used with other algorithms. In particular, we give exper- 
imental evidence to show that the if-flats algorithm |[T6l 
is improved by initialization that is based on the local 
best-fit heuristic. We also use this heuristic to estimate 
the main parameters of both RANSAC (for HLM) fl42l 
and ALC (27). 

• We show how the combination of LBF and the elbow 
method can quickly determine the number of subspaces. 

The rest of this paper is organized as follows. In Sec- 
tion [2] we describe the LBF and SLBF algorithms and state 
a theorem giving conditions that guarantee that good neigh- 
borhoods can be found. Section[3]carefully tests the LBF and 
SLBF algorithms (while comparing them to other common 
HLM algorithms) on both artificial data of synthetic hybrid 
linear models and real data of motion segmentation in video 
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sequences, face clustering and handwritten digits recogni- 
tion. It also demonstrates how to determine the number of 
clusters by applying the fast algorithm of this paper together 
with the straightforward elbow method. Section|4]concludes 
with a brief discussion and mentions possibilities for future 
work. 

2 The local best-fit flats heuristic and the LBF and 
SLBF algorithms 

We describe two methods, LBF and SLBF, which have at 
their heart an estimation of local flats capturing the global 
structures of the data (or part of it). Both methods first find 
a set of candidate flats (the number is an input parameter for 
LBF). These are best-fit flats for local "optimal" neighbor- 
hoods (we describe an algorithm for approximately finding 
such neighborhoods and justify it in Section IZTT i. The two 
algorithms process the candidates in different ways: LBF 
uses energy minimization and SLBF uses a spectral approach. 

2.1 Choosing the optimal neighborhood 

We choose the candidate flats that capture the global struc- 
ture of the data by fitting them to 'optimal' local neighbor- 
hoods of data points. For a point x 6 M. D , we define an 
optimal neighborhood as the largest ball -B(x, r) (centered 
at x and with radius r) that only contains points sampled 
from the same cluster as x. Indeed, neighborhoods smaller 
than the optimal one (around x) can mainly contain the noise 
around an underlying subspace (of the hybrid linear model); 
consequently their local best-fit flats may not match the un- 
derlying flat. On the other hand larger neighborhood than 
the optimal one (around x) will contain points from more 
than one underlying flat, and the resulting best-fit flat will 
again not match any of the underlying flats. We note that the 
choice of neighborhood B(x, r) is equivalent to the choice 
of radius r, which we refer to as scale (even though it is 
also common to refer to log(r) or -log(r) as scale). While 
it is possible to take a guess at the optimal scale as a pa- 
rameter (e.g., as commonly done by fixing the number of 
nearest neighbors to x), we have found that it is possible 
to choose the optimal scale reasonably well automatically, 
while adapting it to the given point x. 

We will start at the smallest scale (i.e., smallest radius 
containing only d + 1 points) and look at larger and larger 
neighborhoods of a given point Xrj. At the smallest scale, 
any noise may cause the local neighborhood to have higher 
dimension than d. As we add points to the neighborhood, it 
becomes better and better approximated in a scale-invariant 
sense (e.g., by scaling the neighborhood to have radius 1 and 
computing the error of approximation by best-fit flat then) 
until points belonging to other flats enter the neighborhood. 



To be more precise, we define the scale-invariant error for a 
neighborhood Af = 5(xq, r) of Xo by the formula: 

&CA0 = )LJL ^i n > W 

max xe jv' ||x - Xo|| 

where \Af\ denotes the number of points in Af, Pj, denotes 
the projection onto the flat L and the minimization is over 
all d-flats in R D . We note that the numerator is the approxi- 
mation error by best-fit £2 fat at scale r and the denominator 
is the scale r. The notion of scale-invariant error was intro- 
duced and utilized in 1117111011221 . 

Using this scale-invariant error we can reformulate our 
criterion for choosing the optimal neighborhood more pre- 
cisely. That is, we start with the smallest neighborhood con- 
taining S nearest neighbors of x, increase the number of 
nearest neighbors by T in each iteration and check the fa 
number of each neighborhood using (Q}. We estimate the op- 
timal neighborhood as the last one for which fa is smaller 
than fa of the previous neighborhood (that is, we search for 
the first local minimizer of fa(Af)). This procedure is sum- 
marized in Algorithm Q] It is experimentally robust to out- 
liers if x is an inlier, since the nearest neighbors of inliers 
also tend to be inliers. 



Algorithm 1 Neighborhood size selection for HLM by ran- 

domized local best-fit flats 

Input: X = {xi,X2, • • • ,xjv} C R D : data, x: a point in 

X, S: start size, T: step size 
Output: A/"(x): a neighborhood of x. 

Steps: 

• k = -1 
repeat 

• k:=k+l 

• Let A4 be the set of the S + kT nearest points in X 
to x 

• Compute fa(k) := fa(Afk) according to (fl} 
until k> 1 and fa(k - 1) < min{fa(k - 2),fa(k)} 

• Output jV(x) :=Afk-i 



2.1.1 Theoretical justification 

The following theorem tries to justify our strategy of es- 
timating the optimal scale around each point by showing 
that in the continuous setting the first local minimizer of 
fa(x, r) :— fa(B(x, r)) is approximately the distance from 
x to the nearest cluster that does not contain x (here the un- 
derlying model is a mixture of Lebesgue measures in strips 
around several subspaces and x is an arbitrary point on one 
of these subspaces). Therefore, if we choose the size of neigh- 
borhood following Algorithm Q] (adapted to the continuous 
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setting), then we will approximately obtain the optimal neigh- 
borhood. It is rather standard to extend such estimates for 
measures to a probabilistic setting, where i.i.d. data is sam- 
pled from the continuous distribution. The theorem will then 
hold with high probability for sufficiently large sample size 
(due to technicalities, which also require truncating the sup- 
port of our continuous measure we avoid these details). The 
proof of this theorem is in the Appendix. 

Our theorem uses the following analog of the discrete #2 
of (Q~|) for a measure /1 and a ball -B(x, r) (see also 1221 '): 

&(x,r) = 
1 



mm ,1 I dist(x,L) 2 du/u(B(x,r)), (2) 

T <i-flats LCI" U JB(x.r) 

where throughout the paper dist(-, •) denotes the Euclidean 
distance, for example in this case dist(x, L) := |x — -Plx||. 
The theorem also assumes that the underlying measure is 
supported on a union of K tubes T(Li, Wi) := {x e M. D : 
|x — -Pl,x|| < Wi}, i — 1, . . .,K (centered around the 
d-flats Li,..., Lk respectively). 

Theorem 1 Let K > 2, d < D , {L t }f =1 be K d-flats in 
M. D , be K Lebesgue measures on tubes {T(Li, w)}fL 1 

C M. D respectively and let p = 2~2^=i l 1 ^ For fixed 1 < i* < 
K and fixed x* £ L^, let 



\JVk I x D data matrix representing the | A4 1 points, which 
requires a complexity of 0(d ■ D ■ |A4|)- To find J\f(x), we 
need to generate /^(A/fc) for any |A4| = S + kT, where 
k = 1,2, ■ ■ ■ , (N — S)/T, hence the complexity for obtain- 
ing J\f(x) is of order: 

(N-S)/T 

0(d-D- (S + kT)) <0(d- D ■ N 2 /2T). 

We thus note that if T is in the order of N, e.g., T = max 
(A7300, 2 )' the tota l complexity of Algorithm [T] is 0{{d ■ 
D + log N) ■ N). Note that if we limit the number of scales 
that we search, then the two log terms above can be replaced 
by a constant. 



2.2 The LBF algorithm 

The LBF algorithm searches for a good set of flats from the 
candidates (described above) in a greedy fashion. A measure 
of goodness of a K tuple of flats G is chosen; here, it will be 
the average l\ distance of each point to its nearest flat, i.e., 



r = r (x*) = dist (x*, U^i»T(Lj, w)) . 



(3) 



If w < tq, then ^(x*, r) (as a function of r) is constant 
on [0, w] and decreases on [w, ro]. If also 



lin (0.02, 



< 



min 0.02, 



50v / 2(£>-l)-ff / ' 
(D-d+2) \ 
6(50) 3 (D-d)K I 



when d = 1; 



when d > 1, 



then there exists tq < r* < 1.09 ro such that 
/3 2 (xV*) > ^ 2 (x*,r ). 



(4) 



(5) 



That is, the first local minimum of ^{x* , r) (as a function 
ofr) occurs in (ro, 1.09 ro). 

The proof of this theorem indicates a weaker condition than 
which is less intuitive. It also shows that r* — > ro as w/ro — > 
and clarifies by example why the first local minimum of 
/?2 (x* , ro) is often bigger than ro (see Remark 1). 

2.1.2 The complexity of Algorithm\l\ 

Algorithm Q] requires sorting the neighbors of x according 
to their distance to x; the computational cost of this prepro- 
cessing step is 0(D ■ N + N ■ log AT). In order to obtain 
/^(■A/fe), we need to obtain the top d singular values of the 



G = G X ({L 1 ,--- ,L K }) 



^2 dist ( x > ^?=i L i 

xGX 



(6) 



After randomly initializing K flats from the list of candi- 
dates, p passes are made through the data points. In each of 
the passes, we replace a random current flat with the candi- 
date that minimizers the value of G. We then move to the 
next pass, picking a random flat, etc. Algorithm [2] sketches 
this procedure (where the greedy minimization of G is de- 
scribed in step |5]l . 

The simplest choice of G is the sum of the squared dis- 
tances of each point in X to its nearest flat, i.e., having the 
power 2 in ©. However, in some scenarios the 1% energy 
of © is more robust to outliers than the mean squared er- 
ror (see [23 24 1 for theoretical support and [44] for experi- 
mental support). This method also allows using energy func- 
tions, which are hard to minimize (even heuristically). In- 
deed, it only requires evaluating the energy on the candi- 
date configurations. For example, when the data set requires 
stronger robustness to outliers, one may use the following 
energy: 

G' = G' X {{L X , ■■■ , L K }) = Mcdian xex dist (x, uf =1 L,) . 

The LBF algorithm is closely related to RANSAC, since 
both of them use candidate subspaces to fit the data set. 
However Algorithm \T\ gives LBF an advantage in choosing 
good candidates, while RANSAC fits a d-flat by arbitrarily 
chosen d + 1 points. 
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Algorithm 2 LBF: energy minimization over randomized 

local best-fit flats 

Input: X = {xi,X2, • • • , xjy} C R D : data, d: dimension 
of subspaces, C: number of candidate planes, K: num- 
ber of output flats/clusters, p: number of passes, S and T: 
parameters for local scale calculation. 
Output: A partition of X into K disjoint clusters {Xi}fL 1 , 
each approximated by a d-dimensional flat. 
Steps: 

1 . Pick C random points in X 

2. For each of the C points find appropriate local scale 
using AlgorithmQ] 

3. Generate a set C containing C candidate flats 
L\,...,Lc from the best fit flats to the neighbor- 
hoods from the previous step 

4. Pick a random subset of K flats C C C 

5. for j = 1 to p do 

• Pick a random flat L* E C, and find L = 
ar gm in ie£ G x ({A{L*}}u{L}) 

• Update L = \L \ {£*}} U {L} 

6. end for 

7. Partition X by sending points to nearest flats in £ 



2.2.1 The complexity and storage of Algorithm^ 

For step [2] of this algorithm we need to run AlgorithmQ] C 
times and thus its complexity is of order 0{{d- D + log N) ■ 
C ■ N). Note that the log N comes from a full sort of N dis- 
tances, and if we restrict to a fixed number of scales, this can 
be replaced by a constant. Step[3]of Algorithmic requires C 
SVD decompositions for C matrices of size at most N x D, 
in order to obtain the first d vectors in R D . It thus also has a 
complexity at most 0(C ■ d ■ D ■ N). 

Step[5]of Algorithm[2]requires the evaluation of the N x 
C matrix representing the distances |jaj* — Pl } Xi \ | between 
X = {x\,X2, ■ ■ ■ ,£_/v}and-Li,Li, • • • , Lc- This costs 0(C 
d ■ D ■ N) operations, since each distance from a point to a 
subspace costs 0(d ■ D). Moreover, the p passes have com- 
plexity of order 0(p ■ (C — K) ■ N). Therefore, step|5]of Al- 
gorithm|2]has a complexity of order 0(C ■ N ■ (d ■ D + p)). 
At last, Step [7] of Algorithm [2] has a complexity of order 
0(K ■ d ■ D ■ N), which comes from the construction of 
the N x K matrix of distances from N points to K sub- 
spaces. Combining these complexities together, we have an 
overall complexity of 0(C ■ N ■ (d ■ D + p + log N)) for 
the LBF Algorithm; as before, if we fix the number of scales 
independently from N, the log terms can be replaced by a 
constant. 

To compute the storage requirements of LBF, we note 
that the data set is saved in an N x D matrix, the candi- 
date subspaces are organized in C projection matrices of 



size D x d and in addition the algorithm stores an N x C 
matrix of distances between the data points and the C candi- 
date subspaces. Therefore, the storage of LBF is in the order 
of 0(D -N + C-D-d + N-C). 

2.3 The SLBF algorithm 

The SLBF algorithm (which is sketched in Algorithm [3) 
processes the candidate subspaces via a spectral clustering 
method. It first finds the neighborhoods {Mi }fLi f° r a U points 
{Xilili via AlgorithmCQand fits d-flats {Li}f =l (viaPCA) 
in these neighborhoods. It then forms the N x N matrices S 
and S as follows: 



Sij = Jdist(xi,Lj)dist(xj,Li), (7) 
and 

Si tj = exp(-S lJ /2 ( 7|) + exp(-S lJ /2a 4 2 ), (8) 
where 

a, = A /mm^llx-Prf/IAGI (9) 

(we explain the choice of A below, when we clarify (|9]i)- Fi- 
nally, it applies spectral clustering with the matrix S. More 
precisely, SLBF follows the main algorithm of ll29l Section 
2], replacing the matrix A there by S, multiplying the unit 
eigenvectors of Step 3 (of ll29l Section 2]) by the corre- 
sponding square roots of eigenvalues and skipping step 4. 
We remark that the two last changes to [29, Section 2] are 
commonly used so that the similarity matrix S can be con- 
sidered as a Gram matrix, see e.g., Euclidean MDS [9| and 
ISOMAPEa. 

As discussed in l37l . SLBF is a "spectral clustering- 
based method", similar to SCC, LSA and SSC. These al- 
gorithms construct an N x N affinity matrix, whose ij-th 
entry represents the similarity between points i and j, and 
then apply spectral clustering using this affinity matrix. Ide- 
ally, the affinities of points from the same cluster are of order 
1 and the affinities of points from different clusters are of or- 
der 0. Indeed, for the affinity S of SLBF, if Xj and Xj are in 
the same cluster, then we expect that X; is close to Lj and 
Xj is close to Li, which means Si j is close to and thus 
Sij is close to 1 (we assume here that Li and Lj are good 
estimators for the underlying subspace of the cluster shared 
by Xi and Xj as suggested by TheoremQ}. Otherwise, if x^ 
and Xj are not in the same cluster, then we expect that X; 
is sufficiently far from Lj and x 3 is sufficiently far from 
Li, which implies that Si j is close to 0. The choice of <7j 
clearly affects this heuristic argument on the size of S%j ■ 
Theoretically Oj should be larger than the noise, such that 
Si.j is close to 1 when x, and Xj are in the same cluster, 
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Algorithm 3 SLBF: spectral clustering based on best-fit 

flats 

Input: X = {xi, X2, • • • ,xjy} C R D : data, A: a parameter 
(or several parameters if we use step 7, with default val- 
ues [2, 2e, 2e 2 , ■ ■ ■ , 2e 6 ]), other parameters used by Algo- 
rithm[T] 

Output: A partition of X into K disjoint clusters {X;}^, 
each approximated by a single flat. 
Steps: 

1 . For each point x«, fit a subspace Li by AlgorithmQ] 

2. Construct the TV x TV matrix S and S by Q, © and 

3. Let D be the TV x TV diagonal matrix, such that 

Dm = Ef=X 

4. Normalize S by : S = D~ 5 SD~ 5 

5. Let U be the N x K matrix whose columns are the 
top K eigenvectors of S, and S be the K xK matrix 
representing the top K eigenvalues of S 

6. Apply TV-means to the rows of TV x K matrix 
U £ 1 and partition X accordingly 

7. Repeat steps 2-6 with the default values of A (see in- 
put) to obtain several segmentations and choose the 
segmentation minimizing the error: 

V min ( V dist 2 (x,Li) ] (10) 



but <jj cannot be too large so that Si.j is close to 1 when 
and Xj are not in the same cluster. Therefore we use 
where <J min d _ flats L Y^eM } jjx - P L x\\ 2 /\Mj | is the esti- 
mated noise of the data set around the point Xj and A is a 
parameter. Following the strategy in Q, we choose different 
values of A (our fixed default values are [2, 2e, 2e 2 , • • • , 2e 6 ]) 
and consequently obtain several segmentation results (7 re- 
sults when using our default values). We then choose the 
segmentation with the smallest error in dlOt . 

We remark that SLBF is robust to outliers. Indeed, the 
original data points are embedded as the rows of U^ 3 (see 
step 6 of Algorithm |3]l and have magnitude smaller than 1 . 
Therefore the subsequent application of TV-means does not 
suffer from points of arbitrarily large values. To verify that 
the magnitude of the embedded points is smaller than 1 , we 
note that the diagonal elements in S are smaller than 1. Since 
U£U T < S the diagonal elements of VE\J T are also 
smaller than 1. Therefore, the norm of the rows of XJE? 
are also smaller than 1 . 

Similar to SLBF, LSA (40) is also based on fitting local 
subspaces. However, LSA fits subspace by local neighbor- 
hoods of fixed number of points and is not adaptive. More- 
over, the local subspaces of LSA are forced to be linear 
(since the affinity of LSA is based on principal angles be- 



tween such subspaces) and this further restricts the applica- 
bility of LSA. There is also some similarity between the idea 
of SLBF and that of SCC 07). Indeed, we may view SCC 
as fitting candidate subspaces based on d+ 1 data points (the 
iterative procedure tries to enforce the points to be from the 
same cluster). However, in practice they operate very differ- 
ently, in particular, SCC is not based just on local informa- 
tion (though a local version of SCC follows from [ 1 1). The 
SSC algorithm is also a spectral method, but similar to SCC 
its affinities are global (they are based on sparse representa- 
tion of data points). 

2.3.1 Complexity and storage of the SLBF algorithm 

Step[T]of Algorithm[3]has a complexity of order 0((d ■ D + 
log TV) ■ TV 2 ), since it applies Algorithm Q] to every point 
in the set X. The most expensive calculation of steps |2E] 
in Algorithm [3] is the construction of S, which requires a 
complexity of order 0(d ■ D ■ TV 2 ). The eigenvalue decom- 
position in step[5]has a complexity of order 0(K ■ TV 2 ) and 
the K -means algorithm in step [6] has a complexity of order 
0(n s ■ TV • TV 2 ), where n s is the iterations in TV-means. 

Combining these complexities together, we have an over- 
all complexity of order 0( (d • TJ + log TV) • TV 2 + n s • TV • TV 2 ) 
for SLBF. As before, limiting to a constant number of scales 
replaces the log term with a constant. 

We note that SLBF stores the data set in a D x TV matrix, 
the candidate subspaces in TV D x d matrices (recall that 
in SLBF every data point is assigned a subspace and thus 
C = TV) and it also uses the TV x TV matrix S. Therefore, the 
storage of SLBF is in the order of 0{N ■ D ■ d + TV 2 ). 

2.4 Adaptation of the proposed algorithms to motion 
segmentation data 

Note that the first minimum in the Theorem[T]excludes the 
left endpoint, and thus k = is excluded in Algorithm[T]i. In 
our experiments, we noticed that on data without too much 
noise, it is useful to allow the first scale to count as a lo- 
cal minimum and allow k = in Algorithm Q]). We refer 
to the implementation of LBF and SLBF with those two 
techniques tailored for motion segmentation data as LBF- 
MS and SLBF-MS. 

3 Experimental results 

In this section, we conduct experiments on artificial and real 
data sets to verify the effectiveness of the proposed algo- 
rithm in comparison to other HLM algorithms. We will see 
that in many situations, the methods we propose are fast 
and accurate; however, in Section [331 we will show a failure 
mode of our method, and discuss how this can be corrected. 
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Table 1 Mean percentage of misclassified points in artificial data for linear-subspace cases or affine-subspace case. 





Linear 


2 2 e r 4 


4 2 6 R 6 


2 4 G R 4 


10 2 6 R 15 


(4,5,6) 

e R 10 


(1,5) 

e R 6 


Outl. % 





30 





30 





30 





30 





30 





30 




1 e% 
t(s) 


2.6 
1.1 


6.9 
0.8 


0.0 

1.0 


2.6 
1.8 


0.1 

1.5 


22.4 
2.0 


0.5 

13.3 


3.8 
5.7 


1.8 
5.1 


28.2 
8.4 


N/A 
N/A 


34.6 
1.9 


LSCC-MS 


e% 
t(s) 


2.7 
1.1 


10.0 

0.5 


0.0 

1.1 


4.1 
l 1A 


0.1 

1.7 


36.7 
1.5 


0.7 
5.1 


31.9 
5.6 


1.4 
4.0 


19.8 
4.6 


N/A 
N/A 


32.9 
2.0 


LSA 


e% 
t(a) 


18.4 

6.8 


19.6 
16.0 


0.1 

7.1 


12.7 
20.8 


0.4 

23.8 


21.0 
54.4 


0.1 
11.7 


9.9 
31.5 


5.9 
20.1 


6.6 
54.4 


27.4 
6.6 


35.4 
13.8 


KF 


e% 
t(s) 


2.5 

0.5 


15.8 
0.6 


2.5 
0.2 


18.4 
0.8 


0.1 

0.7 


34.3 
1.8 


0.0 

0.4 


33.8 
1.0 


1.0 
0.7 


30.6 
2.8 


20.2 
0.3 


23.5 
0.5 


MoPPCA 


e% 
t(s) 


2.5 

0.3 


14.2 
0.5 


0.0 

0.2 


17.7 
0.7 


0.1 

0.7 


34.2 
2.0 


0.0 
0.2 


38.8 
1.1 


1.6 
1.1 


34.7 
3.3 


23.4 
0.5 


24.0 
0.5 


GPCA 


e% 
t(s) 


6.0 
2.1 


2.5 

38.0 


0.0 

1.9 


2.0 

85.2 


0.1 

10.8 


6.3 

136.2 


0.0 

11.2 


14.6 
546.0 


14.6 

73.8 


N/A 
N/A 


5.9 
0.7 


N/A 
N/A 




e% 
t(s) 


2.8 
0.6 


3.7 
0.5 


0.0 

0.5 


2.3 
0.5 


0.1 

1.8 


11.5 

2.7 


0.0 

0.6 


1.9 

0.8 


1.5 
1.1 


1.5 

1.4 


18.8 
0.5 


14.1 
0.5 


LBF-MS 


e% 
t(s) 


2.7 
0.6 


3.0 
0.5 


0.0 

0.4 


2.6 
0.5 


0.1 

1.7 


11.7 
2.6 


0.0 

0.4 


2.2 
0.6 


1.3 
0.9 


1.5 
1.3 


19.5 
0.4 


13.7 
0.4 


SLBF 


e% 
t(s) 


5.2 
11.2 


6.3 
20.7 


0.1 
9.4 


7.0 
21.7 


0.1 

65.0 


23.9 
174.9 


0.0 

9.5 


6.2 
23.3 


2.0 
23.2 


2.4 
64.2 


11.1 

9.3 


13.5 
15.3 


SLBF-MS 


e% 
t(s) 


7.8 
12.0 


11.7 
24.0 


0.1 
8.8 


6.6 

24.4 


0.2 
68.1 


46.6 
202.0 


0.0 

8.4 


4.8 
23.5 


1.9 
22.0 


2.6 
72.4 


19.7 
9.8 


22.1 
16.3 


RANSAC (oracle) 


e% 
t(s) 


2.7 
0.1 


2.6 
0.1 


2.9 
0.1 


2.1 
0.2 


8.0 
0.1 


9.4 
0.2 


0.5 
5.9 


5.8 
6.7 


1.7 
1.5 


1.5 
7.1 


N/A 
N/A 


31.6 
0.2 


RANSAC (£ from LBF) 


e% 
t(s) 


3.2 
0.1 


2.6 
0.1 


2.1 
0.1 


2.4 
0.2 


7.7 
0.1 


9.8 
0.2 


0.4 
5.9 


6.7 
6.7 


1.8 
1.5 


1.5 

7.0 


N/A 
N/A 


30.6 
0.3 


ALC (oracle) 


e% 
t(s) 


4.1 
7.3 


3.4 
23.2 


0.1 
7.7 


16.3 
33.6 


0.1 

28.4 


30.1 
136.3 


50.0 
13.9 


50.0 
172.6 


5.4 
23.0 


36.1 
180.1 


0.3 

7.8 


0.4 
17.3 


ALC (£ from LBF) 


e% 
t(s) 


4.5 
8.0 


5.7 
28.0 


0.1 
8.1 


10.0 

37.9 


0.1 

29.6 


14.0 
121.9 


50.0 
16.6 


50.0 
152.4 


2.5 
24.0 


1.8 
151.6 


0.4 
8.3 


0.3 

18.1 


SSC 


e% 


19.5 
114.8 


34.3 


0.2 


43.5 


0.4 


52.8 


47.0 


44.9 
276.6 


11.5 
185.5 


54.0 
437.9 


9.4 
94.1 


15.9 
142.1 


Linear 


2 2 e r 4 


4 2 e r 6 


2 4 6 R 4 


10 2 6 R 15 


(4,5,6) 

eR 10 


(1,5) 
G R 6 


Outl. % 





30 





30 





30 





30 





30 





30 


sec 


e% 
t(s) 


0.0 

1.2 


0.6 
1.0 


0.0 

1.1 


0.0 

2.0 


0.0 

1.4 


0.5 

2.5 


0.0 

6.1 


0.7 
13.7 


0.0 

5.6 


5.8 
6.0 


N/A 
N/A 


N/A 
N/A 


SCC-MS 


e% 
t(s) 


0.0 

1.2 


2.2 
0.7 


0.0 

1.2 


0.5 
1.6 


0.0 

1.7 


5.8 
2.2 


0.0 

5.4 


0.0 

6.0 


0.0 

4.6 


3.1 
4.8 


N/A 
N/A 


N/A 
N/A 


LSA 


e% 
t(s) 


0.1 

6.7 


11.0 
16.1 


0.0 

7.1 


4.7 
20.8 


0.4 
22.2 


41.7 
54.0 


0.0 

11.7 


0.0 

32.2 


0.0 

38.3 


1.1 
54.0 


37.5 
6.6 


37.9 
13.9 


KF 


e% 
t(s) 


0.2 
0.8 


15.1 
0.6 


0.1 
0.4 


26.0 
0.7 


0.3 
1.0 


37.1 
1.4 


0.0 

0.6 


24.9 
1.7 


0.0 
1.0 


23.5 
1.4 


24.8 
0.5 


27.1 
0.5 


MoPPCA 


e% 
t(s) 


0.2 
0.9 


23.7 
0.5 


0.1 
0.5 


38.3 
0.6 


0.5 
1.1 


39.8 
1.4 


0.0 

0.9 


45.2 
1.9 


0.0 

1.9 


46.8 
2.0 


30.8 
0.5 


30.4 
0.5 


GPCA 


e% 
t(s) 


0.2 
1.8 


18.4 
43.7 


0.2 
3.3 


22.2 
104.0 


0.4 

8.3 


38.1 
209.3 


0.0 

11.8 


27.9 
501.1 


0.3 
69.1 


N/A 
N/A 


N/A 
N/A 


N/A 
N/A 


LBF 


e% 
t( S ) 


0.0 

0.7 


2.0 
0.6 


0.0 

0.5 


0.7 
0.6 


0.0 

1.9 


4.5 

2.8 


0.0 

0.6 


0.3 
0.8 


0.0 

1.2 


0.0 

1.5 


4.7 
0.4 


11.2 
0.5 


LBF-MS 


e% 
t(s) 


0.0 

0.6 


2.7 
0.5 


0.0 

0.4 


1.5 
0.5 


0.0 

1.7 


5.2 
2.7 


0.0 
0.4 


0.5 
0.6 


0.0 
1.0 


0.0 
1.3 


3.9 
0.4 


10.5 
0.4 




e% 
t( S ) 


0.0 

9.3 


1.0 
19.1 


0.0 

5.8 


0.0 

19.0 


0.0 

37.7 


0.1 

143.1 


0.0 

6.3 


0.0 

19.4 


0.0 

35.1 


0.0 

61.4 


0.0 

5.9 


0.0 

14.8 


SLBF-MS 


e% 
t(s) 


0.0 

8.8 


0.1 

21.7 


0.0 

5.6 


0.0 

21.9 


0.0 

38.0 


0.1 

175.5 


0.0 

5.9 


0.0 

21.1 


0.0 

40.1 


0.0 

66.7 


0.0 

5.9 


0.0 

14.3 


RANSAC (oracle) 


e% 
t(s) 


13.8 
0.1 


11.6 
0.2 


9.8 
0.4 


9.6 
1.8 


30.9 
0.4 


27.0 
0.8 


1.9 
6.4 


8.3 
6.8 


1.2 
3.7 


3.4 
7.4 


N/A 
N/A 


23.6 
0.5 


RANSAC (e from LBF) 


e% 
t(s) 


13.6 
0.1 


11.6 
0.2 


11.6 
0.4 


10.4 
1.9 


29.9 
0.4 


28.5 
0.8 


1.4 
6.4 


9.6 
6.7 


1.2 
3.7 


2.4 
7.4 


N/A 
N/A 


23.1 
0.5 


ALC (oracle) 


e% 
t( S ) 


0.0 

17.6 


0.0 

25.2 


0.0 

16.6 


0.0 

39.1 


0.0 

64.2 


25.1 
119.3 


0.0 

20.0 


40.0 
43.0 


0.0 

39.7 


65.0 
92.7 


0.0 

18.3 


0.0 

36.8 


ALC (e from LBF) 


e% 
t(s) 


0.0 

18.7 


0.4 

26.8 


0.0 

17.2 


0.0 

29.8 


0.0 

65.2 


0.3 
113.6 


0.0 

24.4 


0.0 

55.5 


0.0 

47.9 


0.0 

85.2 


0.0 

18.8 


0.0 

38.9 




e% 


0.0 


1.9 


0.0 


0.1 


0.1 


6.4 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 
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We measure the accuracy of those algorithms by the rate 
of misclassified points with outliers excluded, that is 



error% 



# of misclassified inliers 
# of total inliers 



x 100% . 



(11) 



In all the experiments below, the number C in Algo- 
rithm |2] is 70 • K, where K is the number of subspaces, 
the number p in Algorithm [2] is 5 • K , and the numbers S 
and T in Algorithm[TJare 2 • d and 2 respectively, where d is 
the dimension of the subspace. According to our experience, 
LBF and SLBF are very robust to changes in parameters, 
but unsurprisingly, there is a general trade off between ac- 
curacy (higher C, higher p, smaller T), and run time (lower 
C, lower p, larger T). We have chosen these parameters for 
a balance between run time and accuracy. Nevertheless, we 
have insisted to use the same parameters for all data sets and 
experiments, even though particular parameters could obtain 
even better or near perfect results for some of the data sets. 
The experiments in Sections [3.11 and [3~2l run on a computer 
with Intel Core 2 CPU at 2.66GHz and 2 GB memory, and 
experiments in Sections [3.31 and [3~4l run on a machine with 
Intel Core 2 Quad Q6600 at 2.4GHz and 8 GB memory. 



3.1 Clustering results on artificial data 

We compare our algorithms with the following algorithms: 
Mixtures of PPCA (MoPPCA) O, if -flats (KF) Q6), Lo- 
cal Subspace Analysis (LSA) [40|, Spectral Curvature Clus- 
tering (SCC) 0, Random Sample Consensus (RANSAC) 
for HLM EH, Agglomerative Lossy Compression (ALC) J27] 
and GPC A with voting/robust GPCA (GPCA) M28II411 . Through 
out the rest of the paper, we use the Matlab codes of the 
GPCA, MoPPCA and KF algorithms from http://perception.csl. umc.^edu/gpca 
the LSA algorithm from http://www. The artificial data represents various instances of K lin- 

vision.jhu.edu/db, the SCC algorithm from |hrtp://www.mathlim i^id^b#SOianife'RP . If their dimensions are fixed and equal 
the ALC algorithm from http:// 

perception.csl.uiuc.edu/coding/motion/, the RANSAC algo- 
rithm from http://www.vision.jhu.edu/code/and the SSC al- 
gorithm from http://www.cis.jhu. edu/^ehsan/ssc. htm. 

For the SCC algorithm, we also try a slightly modified 
version tailored for motion segmentation as in step [6] of Al- 
gorithm [3] which we refer to as SCC-MS (SCC for motion 
segmentation): Following the notation of [Algorithm 2]J7] 
we let the matrix U be the N x K matrix whose columns 
are the top K left singular vectors of A* c and also denote 
by S the diagonal K x K matrix whose elements are the 
top K left singular values of A^. Then the if -means step of 
SCC-MS is applied directly to the rows of the N x K ma- 

1/9 

trix US' (as opposed to applying it to U (or its row-wise 
normalization by 1) in SCC). 

The MoPPCA algorithm is always initialized with a ran- 
dom guess of the membership of the data points. The LSCC 



algorithm is initialized by randomly picking 100 x K (d+1)- 
tuples (following [7|) and KF is initialized with a random 
guess. Since algorithms like KF tend to converge to local 
minimum, we use 10 restarts for MoPPCA, 30 restarts for 
KF, and recorded the misclassification rate of the one with 
the smallest £2 error for both of these algorithms. The num- 
ber of restarts was restricted by the running time and accu- 
racy. In SSC algorithm, we set the value A to be 0.01, as 
suggested in the code. 

The RANSAC for HLM and ALC algorithms l42ll27l 
depend on a user supplied inlier threshold. RANSAC (ora- 
cle) and ALC (oracle) use the oracle inlier bound given by 
the true noise variance of the model and thus clearly have 
an advantage over the other algorithms listed. RANSAC (e 
from LBF) and ALC (e from LBF) estimate the inlier thresh- 
old by the local best-fit flats heuristic of this paper. That is, 
they fit best-fit neighborhoods for all N points using the lat- 
ter heuristic and estimate the least error of approximation by 
d-flats in these N neighborhoods. The inlier bound e is then 
the average of these errors. If the number of clusters result- 
ing from ALC (e from LBF or oracle) is larger than K, then 
we choose the K largest clusters and identify the points in 
the rest of clusters as outliers. For some cases the RANSAC 
algorithm breaks down and we then report it as N/A. The 
reason for this is that RANSAC (for HLM) 021 is very sen- 
sitive to the estimate of e and an overestimate can result in 
removal of points belonging to more than one subspace, so 
that the algorithm may exhaust all points before detecting K 
subspaces. 

We remark that GPCA cannot naturally deal with out- 
liers, therefore we use robust GPCA with Multivariate Trim- 
ming [41 1 and the parameters 'angleTolerance' and 'bound- 
are set to be 0.3 and 0.4 respectively. 



If 



d, we follow [7 1 and refer to the setting as d <E 
they are mixed, then we follow [28] and refer to the setting 
as (d%, . . . , dx) G IR 15 . Fixing K and d (or d%, . . ■ , dx), we 
randomly generate 100 different instances of corresponding 



hybrid linear models according to the code in http://perception.csl.uiuc.edu/ 
More precisely, for each of the 100 experiments, K lin- 
ear subspaces of the corresponding dimensions in M. D are 
randomly generated. The random variables sampled within 
each subspace are sums of two other variables. One of them 
is sampled from a uniform distribution in a d-dimensional 
ball of radius 1 of that subspace (centered at the origin for 
the case of linear subspaces). The other is sampled from a 
£>-dimensional multivariate normal distribution with mean 
and covariance matrix 0.05 2 • Idxd- Then, for each sub- 
space 250 samples are generated according to the distribu- 
tion just described. Next, the data is further corrupted with 
5% or 30% uniformly distributed outliers in a cube of side- 
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Table 2 The mean and median percentage of misclassified points for two-motions and three-motions in Hopkins 155 database. 



2-motion 


Checker 


Traffic 


Articulated 


All 


Mean 


Median 


Mean 


Median 


Mean 


Median 


Mean 


Median 


GPCA 


6.09 


1.03 


1.41 


0.00 


2.88 


0.00 


4.59 


0.38 


LLMC 5 


4.37 


0.00 


0.84 


0.00 


6.16 


1.37 


3.62 


0.00 


LSA AK 


2.57 


0.27 


5.43 


1.48 


4.10 


1.22 


3.45 


0.59 


LBF(4if,3) 


3.65 


0.00 


3.89 


0.00 


4.43 


0.15 


3.78 


0.00 


LBF-MS(4i\T,3) 


2.90 


0.00 


1.64 


0.00 


2.51 


0.06 


2.54 


0.00 


SLBF(2F,3) 


1.59 


0.00 


0.20 


0.00 


0.80 


0.00 


1.16 


0.00 


SLBF-MS(2F,3) 


1.28 


0.00 


0.21 


0.00 


0.94 


0.00 


0.98 


0.00 


SCC(4K,3) 


2.42 


0.00 


4.44 


0.00 


9.51 


2.04 


3.60 


0.00 


SCC-MS(4X,3) 


2.00 


0.00 


0.35 


0.00 


4.11 


1.12 


1.77 


0.00 


SSC-N(4if ,3) 


1.29 


0.00 


0.29 


0.00 


0.97 


0.00 


1.00 


0.00 


MSL 


4.46 


0.00 


2.23 


0.00 


7.23 


0.00 


4.14 


0.00 


RANSAC 


6.52 


1.75 


2.55 


0.21 


7.25 


2.64 


5.56 


1.18 


3 -motion 


Checker 


Traffic 


Articulated 


All 


Mean 


Median 


Mean 


Median 


Mean 


Median 


Mean 


Median 


GPCA 


31.95 


32.93 


19.83 


19.55 


16.85 


28.66 


28.66 


28.26 


LLMC AK 


12.01 


9.22 


7.79 


5.47 


9.38 


9.38 


11.02 


6.81 


LLMC 5 


10.70 


9.21 


2.91 


0.00 


5.60 


5.60 


8.85 


3.19 


LSA AK 


5.80 


1.77 


25.07 


23.79 


7.25 


7.25 


9.73 


2.33 


LSA 5 


30.37 


31.98 


27.02 


34.01 


23.11 


23.11 


29.28 


31.63 


LBF(4X,3) 


8.50 


1.26 


16.31 


13.52 


20.75 


20.75 


10.77 


1.70 


LBF-MS(4if,3) 


6.97 


1.15 


7.06 


0.62 


21.38 


21.38 


7.81 


0.98 


SLBF(2F,3) 


4.57 


0.94 


0.38 


0.00 


2.66 


2.66 


3.63 


0.64 


SLBF-MS(2F,3) 


3.33 


0.39 


0.24 


0.00 


2.13 


2.13 


2.64 


0.22 


SCC(4_R",3) 


7.80 


1.04 


8.05 


2.37 


7.07 


7.07 


7.81 


0.67 


SCC-MS(4F,3) 


6.28 


0.80 


4.09 


0.58 


7.22 


7.22 


5.89 


0.68 


SSC-N(4if,3) 


3.22 


0.29 


0.53 


0.00 


2.13 


2.13 


2.62 


0.22 


MSL 


10.38 


4.61 


1.80 


0.00 


2.71 


2.71 


8.23 


1.76 


RANSAC 


25.78 


26.01 


12.83 


11.45 


21.38 


21.38 


22.94 


22.03 



Table 3 The standard deviation to the mean and median percentage of misclassified points for two-motions and three-motions in Hopkins 155 
database. 



2-motion 


Checker 


Traffic 


Articulated 


All 


Mean 


Median 


Mean 


Median 


Mean 


Median 


Mean 


Median 


LBF(4if,3) 


0.71 


0.00 


1.22 


0.00 


1.04 


0.66 


0.50 


0.00 


LBF-MS(4if,3) 


0.53 


0.00 


1.06 


0.00 


1.14 


0.28 


0.47 


0.00 


WLBF(4if,3) 


0.53 


0.00 


0.98 


0.00 


1.35 


0.00 


0.47 


0.00 


SLBF-MS(4if,3) 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


SCC(4if,3) 


0.27 


0.00 


1.51 


0.00 


2.34 


1.52 


0.38 


0.00 


SCC-MS(4_fS",3) 


0.33 


0.00 


0.25 


0.00 


1.03 


0.46 


0.25 


0.00 


SSC-N(4/f,3) 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


3-motion 


Checker 


Traffic 


Articulated 


All 


Mean 


Median 


Mean 


Median 


Mean 


Median 


Mean 


Median 


LBF(4if,3) 


1.52 


0.58 


3.71 


9.69 


7.37 


7.37 


1.43 


0.65 


LBF-MS(4F,3) 


1.48 


0.45 


3.81 


2.35 


6.59 


6.59 


1.42 


0.40 


SLBF(4_fs:,3) 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


SLBF-MS(4if,3) 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


SCC(4if,3) 


1.20 


0.53 


5.70 


7.00 


1.77 


1.77 


1.43 


0.49 


SCC-MS(4_fS",3) 


0.94 


0.50 


3.25 


0.54 


2.54 


2.54 


0.92 


0.33 


SSC-N(4F,3) 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 



length determined by the maximal distance of the former 
250 samples to the origin (using the same code). 

Since most algorithms (SCC, LSA, MoPPCA, LBF, SLBF, 
RANSAC, SSC) do not support mixed dimensions natively, 
we assume each subspace has the maximum dimension in 
the experiment. GPCA and ALC support mixed dimensions 
natively, and GPCA is the only algorithm for which we spec- 
ify the dimensions for each subspace in mixed-dimension 
case (note that the knowledge of dimensions are unneces- 
sary in ALC algorithm). 

The mean (over 100 instances) misclassification rates 
and the mean running time of the various algorithms are 
recorded in Table Q] From Table Q] we can see that our algo- 
rithms, i.e., LBF, LBF-MS, SLBF, SLBF-MS, perform well 



in various artificial instances of hybrid linear modeling (with 
both linear subspaces and affine subspaces), and their ad- 
vantage is especially obvious with many outliers and affine 
subspaces. The robustness to outliers is a result of our use 
of both l\ loss function (see e.g., |23,24|) and random sam- 
pling. The SLBF and SLBF-MS are better at the affine cases 
because of their use of spectral clustering. Also unlike many 
other methods, the proposed methods natively support affine 
subspace models (the particular data has non-intersecting 
subspaces, which makes advantageous to some other algo- 
rithms, e.g., SSC). The results of RANSAC (e from LBF) 
and ALC (e from LBF) show that the local best-fit heuris- 
tic can be effectively used to estimate the main parameter of 
RANSAC and ALC, i.e., to estimate the local noise. Table 
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Fig. 2 Frames of the traffic, articulated and checker (from left to right) 
videos in Hopkins 155 database. 



of three types of videos: checker, traffic and articulated (see 
Figure |2] for demonstration of frames of such videos). 

More formally, for a given video sequence, we denote 
the number of frames by F. In each sequence, we have ei- 
ther one or two independently moving objects, and the back- 
ground can also move due to the motion of the camera. We 
let K be the number of moving objects plus the background, 
so that K is 2 or 3 (and distinguish accordingly between 
two-motions and three-motions). For each sequence, there 



are also N feature points y i , y2 , • 



that are 



detected on the objects and the background. Let £ 



Fig. 1 The misclassification rate of some algorithms for the Hopkins 
155 database. The y-axis represent the percentage of data sets that 
have misclassification rates (under corresponding algorithms) lower 
than that of x -axis. 



Q] also shows that the running time of LBF/LBF-MS is less 
than the running time of most other algorithms, especially 
GPCA, LSA, RANSAC, ALC and SSC. The difference is 
large enough that we can also use the proposed algorithm 
as an initialization for the others. LBF and LBF-MS algo- 
rithms are slower than a single run of if -flats, but it usually 
takes many restarts of if -flats to get a decent result. Notice 
that the choice of C and p in our algorithm function in a 
similar manner to the number of restarts in KF. SLBF and 
SLBF-MS cost more time when N is large, because of the 
construction of the N x N matrix in spectral clustering, but 
it still has a comparable speed to LSA and is faster than SSC, 
which are two spectral-clustering based algorithms. 



3.2 Clustering results on motion segmentation data 

We test the proposed algorithms on the Hopkins 155 database 
of motion segmentation, which is available at http://www. vision 
This data set contains 155 video sequences along with the 
coordinates of certain features extracted and tracked for each 
sequence in all its frames. The main task is to cluster the 
feature vectors (across all frames) according to the different 
moving objects and background in each video. It consists 



be the coordinates of the feature point yj in the r im- 
age frame for every 1 < i < F and 1 < j < N. Then 
Zj = [zij, Z2j, • • • ,zpj] G K 2F is the trajectory of the 
j th feature point across the F frames. The actual task of 
motion segmentation is to separate these trajectory vectors 
Zi, Z2, • • • , za? into K clusters representing the K underly- 
ing motions. 

It has been shown [8 1 that under the affine camera model, 
the trajectory vectors corresponding to different moving ob- 
jects and the background across the F image frames live 
in distinct affine subspaces of dimension at most three in 
M. 2F . Following this theory, we implement our algorithm 
with d = 3. 

We compare our algorithm with the following ones: im- 
proved GPCA for motion segmentation (GPCA) (39), K- 
flats (KF) [16] (implemented for linear subspaces), Local 
Linear Manifold Clustering (LLMC) 1151 . Local Subspace 
Analysis (LSA) (40), Multi Stage Learning (MSL) [32], Spec- 
tral Curvature Clustering (SCC) Q and SCC-MS (see de- 
scription earlier), Sparse Subspace Clustering (SSC) [11 1, 
and RANSAC for HLM ||42l . 

For GPCA (improved for motion segmentation), LLMC, 
LSA, MSL and RANSAC (for HLM), we copy the results 
from http://www.vision.jhu.edu/data/hopkinsl55 (they are 
jTMKedttDUatapShripteiiitslfe^jortedin ll35l and 02)). We perform 
our own experiments for SCC, SCC-MS, SSC-N (SSC-B is 
not reported since it did not perform as well as SSC-N), 
LBF, LBF-MS, SLBF, SLBF-MS, we perform the experi- 
ments on our own and record the mean misclassification rate 
and the median misclassification rate for each algorithm for 
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Table 4 Average total computation times for all 155 sequences. 



RANSAC 


LBF-MS (4K,3) 


LBF (4K,3) 


SCC-MS(4iv~,3) 


SLBF-MS(2F,3) 


SLBF(2F,3) 


SSC-N(4if,3) 


60s 


73s 


91s 


196s 


28min 


31min 


427min 



any fixed K (two or three-motions) and for the different type 
of motions ("checker", "traffic" and "articulated"). Each ex- 
periment (testing the latter set of algorithms) was repeated 
500 times. The average misclassification rates, standard de- 
viation and running time are recorded in Tables [5] and [6] and 
demonstrated in Figure Q] 

Our misclassification rates for SCC are different than (|6| 
and lEUl and our misclassification rates for SSC are different 
than ifTTI (the difference between our and their results differ 
more than twice the standard deviations of misclassification 
rates obtained here). This can be explained by possible evo- 
lutions of the codes since then (at least for SSC). We remark 
though that the misclassification rates of SCC-MS here are 
even slightly better than the misclassification rates of SCC 
in E). 

From Table [2] and Figure Q] we can see that our algo- 
rithms work well for the Hopkins database. Of all the meth- 
ods tested, SLBF-MS and SSC-N are the most accurate al- 
gorithms. Besides SLBF/SLBF-MS and SSC-N, only SCC- 
MS is better than LBF-MS. However, From Table gl LBF- 
MS ran more than 100 times faster than SSC-N and SLBF- 
MS is also more than 10 times faster than SSC. In many of 
the cases, the i\ energy (as well as the energy) was lower 
for the labels obtained by LBF than the true labels. We thus 
suspect that the reason SLBF/SLBF-MS works better than 
LBF/LBF-MS is that good clustering of the Hopkins data 
requires additional type of information (e.g., spectral infor- 
mation) to be combined with subspace clustering (i.e., hy- 
brid linear modeling). 

By adapting the parameters of SLBF-MS (or alterna- 
tively, SLBF, LBF, LBF-MS), we can further improve its 
misclassification rates on Hopkins 155 (e.g., total 0.81% for 
two-motions and 2.12% for three-motions by SLBF-MS). 
However, we have fixed in advance all parameters and in- 
sisted using the same parameters on all four kinds of data 
(see the third paragraph of Section[3]i. 

From Table [3] we can see that SLBF-MS, SLBF and 
SSC-N have negligible randomness. Indeed, their only ran- 
domness come from the Jf-means step, but this randomness 
is effectively reduced because of the restarting strategy. LBF 
and LBF-MS are more random, but still have comparable 
standard deviations with other good algorithms on Hopkins 
155 database such as SCC/SCC-MS. 



3.3 Clustering results on the extended Yale face database B 

We test LBF, LBF-MS, SLBF and SLBF-MS and compare 
them with ALC, if-flats, and SSC on the extended Yale face 
database B [21 1, which is available on http://vision.ucsd.edu/ 
leekc/ExtYaleDatabase/Ext YaleB.html. We will see that this 
data set shows a failure mode of our algorithms; and we will 
show how we can engineer a workaround. 

We use subsets of the extended Yale face database B 
consisting of face images of K = 2, 3, ■ ■ • , 10 persons un- 
der 64 varying lighting conditions. Our objective is to cluster 
these images according to the persons. In implementation, 
for any fixed K we repeat each algorithm on 100 randomly 
chosen subsets of K persons. The HLM model is applicable 
to this database, because the images of a face under vari- 
able lighting lies in a three-dimensional linear subspace if 
shadow is not considered 1211 . and a nine-dimensional sub- 
space with shadow considered [2|. In our experiments, we 
found that the images of a person in this database lie roughly 
in a 5-dimensional subspace, and therefore we first reduce 
the dimension of the data to hK (recall that K is the number 
of persons). We do not include the GPCA algorithm since 
it is slow and does not work well on this database. We also 
do not include SCC and RANSAC since the code provided 
returns errors for some examples. The setting of ALC (vot- 
ing with K) follows l30l sec. (2.2)] exactly: it chooses e 
from 101 values in the range 10~ 5 — 10 3 (see the code in 
http://perception.csl.uiuc.edU/coding/motion/#Software). 

We can see from the first row of Table [5] that LBF does 
a poor job discriminating the linear clusters in this data set. 
The failure occurs because of a combination of two factors: 
the first is the relatively sparse sampling of the data, with 
only 64 points per 5-dimensional cluster, and the second, 
the relative nearness of the underlying subspaces to each 
other. In particular, almost any neighborhood of any given 
point (even very small neighborhood) has points from the 
other affine clusters and consequently there is no "optimal" 
scale. For example, in the 128 face images from persons 1 
and 2, more than a fifth of the points are closer to the sub- 
space spanned by the first 5 principal components of the 
points in the other cluster than to their second nearest neigh- 
bors, and more than two thirds of the points are closer to 
the other subspace than to their 4th nearest neighbors. In 
some sense, this is a single 5-dimensional set, rather than 
two 5-dimensional sets. For example, the average distance 
of a point to the 5-dimensional best fit subspace by the points 
in the same cluster is 2.7 x 10 3 , and the average distance to 
the 5-dimensional best fit subspace of the whole data set is 
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3.3 x 10 3 , whereas the average norm of a point in the data set 
is 1.1 x 10 4 . Thus one loses little in terms of relative fitting 
error by considering the set as spanned by a single subspace. 

However, most points are actually closer to the subspace 
spanned by the same face than to the subspace spanned by 
the other face, if only by a little, and a global method such 
as SSC is still able to find and discriminate between the two 
affine clusters. The problem of data having large variance in 
directions irrelevant to a classification task is not unusual. A 
standard method of dealing with this situation is to "whiten" 
the data; i.e. reduce the value of the large singular values. 
A very crude whitening is obtained by simply removing the 
first few principal components. If we exclude the first two 
principal components after reducing the dimension to 5K 
for LBF/SLBF algorithms, we see in Table |5]that the results 
are greatly improved and become competitv^. With more 
sophisticated whitening, the results can be further improved. 

3.4 Clustering results on MNIST data set 

Finally, we work on the MNIST data set (available at http://ya 
nn.lecun.com/exdb/mnist/). This data set consists of several 
thousand 28 x 28 images of the digits through 9. We work 
on some subsets of the data which contain 2 or 3 digits and 
choose 200 images for each digit at random. We apply PCA 
to reduce the dimension to D = 5 for GPCA and to both 
D = 10 and D = 50 for the rest of algorithms. The choice 
of both D = 10 and D = 50 provide richer testing opportu- 
nities, this is however unavailable for GPCA, which cannot 
handle D = 50 and often get stuck with D — 10. We pro- 
cess the data the same way as in Section 13.31 We run each 
experiment 500 times, using d = 3 and the correct number 
of clusters, and record the misclassification rates, the stan- 
dard deviation and running time in Tables 171191181 and [TOl 

From Table |7j and [HJ SLBF and SLBF-MS are the best 
algorithms among all the methods in terms of misclassifi- 
cation rates, although these misclassification rates are larger 
when K = 3. SCC, SCC-MS, SSC, LBF and LBF-MS are 
also good algorithms for this data set. ALC is almost as good 
as SLBF and SLBF-MS when K = 2, but it fails when 
K = 3. LBF, LBF-MS and if -flats are the fastest algorithms 
in MNIST data set. 



3.5 Automatic determination of the number of flats 

We explain how to use the elbow method to determine the 
number of affine clusters in any HLM algorithm, in par- 
ticular LBF and SLBF. Fixing an arbitrary HLM algorithm 
with the correct input of number of clusters K, let Fj , j — 

3 Removing principal components harms the performance of the 
other algorithms. 



1 , . . . , K be the K flats returned by that algorithm and Wk 
be the sum of squared distances of all data points to the flat, 
among these K flats, corresponding to their clusters. That is, 

K 

W K = J2 Yl dist 2 (x,Fi). (12) 

We note that Wk decreases as K increases. 

A classical method for determining the number of clus- 
ters is to find the "elbow", or the K past which adding more 
clusters does not significantly decrease the error. We search 
for the elbow by finding the maximum of the Second Order 
Difference (SOD) of the logarithm of Wk l43l: 

SOD(lnW^) = lnl^r-i + lnW K +i -2lnW K . (13) 

The optimal K is thus found by 

^sod = argmaxSOD(lnWK), ■ (14) 

where K = 2, ... , K max . 

In the following sections, we compare SOD (LBF), i.e., 
SOD applying LBF, SOD (LBF-MS), SOD (SLBF), SOD 
(SLBF-MS), SOD (SCC), SOD (SCC-MS) and SOD(SSC) 
with ALC E7J and part of GPCA [38] on a number of arti- 
ficial data sets and real data sets. These experiments run on 
a machine with Intel Core 2 Quad Q6600 at 2.4GHz and 8 
GB memory. 

5.5.7 Finding the number of clusters on artificial data 

We test SOD with LBF and SLBF on artificial data (where 
the number of clusters is not provided to the user) and com- 
pare them with some other methods (three variations of ALC, 
number of clusters by GPCA and SOD with SSC and SCC). 
The artificial data sets are generated by the Matlab code bor- 
rowed from the GPCA [ 38 1 package on http://perception.csl. 
uiuc.edu/gpca. For each subspace 100<i initial data points are 
uniformly sampled in a unit cube in this subspace centered 
around the origin and then corrupted with Gaussian noise in 
M. D of standard deviation 0.05. For the last four experiments, 
we restrict the angle between subspaces to be at least it/ 8 for 
separation. The dimension d is given and we let K niax = 10 
in SOD. 

In ALC (voting), we try 101 different values from 10~ 5 
to 10 3 for e (as in 113011 ) and choose the estimated K by ma- 
jority. In ALC (e from LBF), we choose the average noise 
in the neighborhood using the local best-fit heuristic as the 
distortion rate e. In ALC (oracle), we input the true noise 
level (e = 0.05) as the distortion rate. For GPCA, we use 
the original idea of (38, eqs. (26), (28)] to find the num- 
ber of clusters (see our implementation in the supplemen- 
tal webpage). We project the data onto a d + 1 -dimensional 
subspace by PCA and let the tolerance of rank detection be 
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Table 5 Mean percentage of misclassified points and mean running time on clustering the extended Yale face database B. 



K 


2 


3 


4 


5 


6 


7 


8 


9 


10 


LBF(without whitening) 


e% 
t(s) 


32.49 
0.24 


54.42 
0.48 


57.45 
0.82 


56.00 
1.26 


56.24 
1.93 


56.94 
2.97 


59.53 
4.18 


59.66 
5.81 


60.74 
8.05 


LBF-MS(without whitening) 


e% 
t(s) 


18.27 
0.12 


36.22 
0.21 


48.24 
0.36 


50.18 
0.57 


49.99 
0.89 


50.68 
1.41 


53.08 
2.06 


54.06 
2.98 


54.73 
4.13 




e% 
t(s) 


7.94 
0.24 


8.33 
0.50 


12.89 
0.87 


17.83 
1.38 


27.40 
2.09 


31.89 
3.28 


35.04 
4.58 


38.53 
6.38 


38.95 
8.57 


LBF-MS 


e% 
t(s) 


8.40 
0.12 


9.51 
0.22 


12.18 
0.37 


15.57 
0.58 


19.18 
0.89 


22.88 
1.41 


27.20 
2.07 


30.39 
2.94 


33.17 
4.02 


SLBF 


e% 
t(s) 


11.12 
4.17 


14.78 
12.72 


20.42 
25.70 


26.52 
44.89 


32.96 
72.99 


36.91 
111.58 


40.49 
165.47 


42.99 
226.56 


46.63 
310.30 


SLBF-MS 


e% 
t(s) 


9.12 
3.84 


12.48 
12.20 


18.61 

23.88 


25.27 
41.24 


30.50 
64.10 


33.97 
95.73 


36.22 
142.09 


38.66 
192.34 


41.44 
262.40 


e% 
t(s) 


3.46 

42.99 


6.08 

122.29 


14.59 
258.20 


29.59 
451.07 


67.06 
699.52 


69.04 
1090.96 


76.00 
1625.10 


73.94 
2384.69 


77.16 

3343.93 


ALC 
(e from LBF) 


e% 
t(s) 


10.43 
0.95 


15.23 
2.49 


32.20 
5.54 


42.15 
11.54 


58.10 
24.38 


62.54 
45.27 


70.84 
78.05 


81.14 
132.35 


84.25 
211.15 




e% 
t(s) 


5.39 
1.62 


11.82 
3.85 


29.39 
9.52 


41.96 
15.37 


49.56 
22.71 


54.51 
32.45 


55.49 
54.54 


57.24 
56.91 


58.94 
75.92 


SCC-MS 


e% 
t(s) 


4.51 
1.62 


15.05 
4.20 


36.00 
9.28 


51.68 
14.49 


59.66 
22.08 


64.15 
31.71 


68.71 
54.21 


71.18 
56.99 


74.01 
73.10 


SSC 


e% 
t(s) 


6.45 
28.36 


8.10 
46.45 


10.04 

67.11 


10.32 

92.75 


11.02 

128.46 


11.85 

182.65 


12.47 

259.66 


13.41 

340.12 


15.44 

612.21 


K-flats 


e% 
t(s) 


7.20 
0.16 


12.12 
0.37 


19.06 
0.76 


26.77 
1.29 


32.59 
2.14 


35.18 
3.25 


38.58 
5.18 


42.00 
6.91 


44.40 
9.60 



Table 6 The standard deviation(%) to the mean percentage of misclassified points on the extended Yale face database B. 



Real K 


2 


3 


4 


5 


6 




8 


9 


10 


LBF(without whitening) 


20.46 


14.73 


4.87 


3.89 


5.54 


4.99 


4.63 


3.95 


3.22 


LBF-MS(without whitening) 


18.23 


18.77 


13.40 


6.74 


4.52 


5.51 


5.31 


4.90 


4.14 


LBF 


5.27 


3.73 


7.97 


9.86 


11.21 


10.38 


8.27 


6.52 


6.20 


LBF-MS 


4.25 


3.08 


5.33 


6.24 


7.73 


8.02 


8.29 


8.05 


7.25 


SLBF 


4.76 


5.37 


5.08 


5.25 


5.48 


5.42 


4.57 


4.74 


3.79 


SLBF-MS 


4.77 


5.37 


5.84 


4.91 


3.75 


3.76 


2.87 


3.01 


3.22 


ALC(votingwithK) 


2.21 


6.93 


13.87 


14.89 


16.84 


24.71 


18.05 


21.62 


16.98 


ALC(e from LBF) 


13.14 


12.96 


14.91 


16.40 


15.22 


12.22 


10.89 


6.76 


6.10 




11.71 


14.65 


10.60 


6.68 


5.14 


4.67 


4.32 


5.03 


SCC-MS 


2.84 


13.66 


14.66 


10.41 


8.29 


6.72 


5.61 


5.93 


5.46 


SSC 


4.57 


3.76 


4.52 


3.82 


3.59 


2.87 


3.18 


3.45 


5.21 


K-flats 


4.67 


6.86 


8.53 


8.89 


7.29 


6.41 


6.67 


4.84 


5.43 



Table 7 Mean percentage of misclassified points and mean running time on clustering MNIST data set (D=5 for GPCA, D=10 for other algo- 
rithms). 



subsets 


[1 2] 


[13] 


[17] 


[4 7] 


[2 4 8] 


[3 6 8] 


[12 3] 


K 


2 


2 


2 


2 


3 


3 


3 


LBF 


e% 


8.0 


8.5 


12.9 


25.5 


28.8 


28.1 


20.2 


t(s) 


0.4 


0.4 


0.3 


0.4 


0.7 


0.7 


0.7 


LBF-MS 


e% 


9.7 


7.8 


8.8 


24.0 


40.2 


33.5 


21.5 


t(s) 


0.2 


0.2 


0.2 


0.2 


0.5 


0.4 


0.4 




e% 


0.5 


1.0 


2.0 


3.0 


3.8 


19.7 


17.3 




t(s) 


13.9 


13.7 


13.5 


14.5 


41.9 


41.0 


42.7 


SLBF-MS 


e% 


0.5 


1.0 


2.0 


3.0 


3.8 


19.7 


17.3 


t(s) 


12.8 


13.7 


13.0 


14.6 


38.6 


46.3 


39.0 


ALC 


e% 


0.2 


2.2 


3.5 


48.5 


4.2 


42.7 


45.3 


(voting with K ) 


t(s) 


830.5 


823.3 


813.3 


753.2 


1789.5 


1871.8 


1987.7 


ALC 


e% 


20.3 


32.0 


51.8 


27.5 


4.0 


30.3 


14.5 


(e from LBF) 


t(s) 


23.2 


22.5 


21.6 


23.0 


55.6 


54.7 


54.0 


sec 


e% 


7.0 


6.4 


11.4 


23.4 


22.8 


26.7 


39.2 


t(s) 


1.2 


1.5 


1.4 


1.3 


2.5 


2.7 


2.3 


SCC-MS 


e% 


6.3 


7.9 


10.5 


23.2 


23.3 


26.9 


32.8 


t(s) 


0.9 


0.8 


1.1 


1.0 


1.9 


1.9 


1.5 




e% 


22.3 


30.8 


32.5 


47.0 


48.2 


33.8 


31.0 




t(s) 


8.7 


9.2 


9.4 


10.8 


24.9 


24.5 


22.5 


K-flats 


e% 


11.1 


6.8 


6.3 


29.1 


43.9 


40.7 


29.2 


t(s) 


0.4 


0.4 


0.4 


0.4 


0.9 


0.8 


0.6 




e% 


4.5 


3.5 


9.0 


21.0 


19.5 


24.5 


49.3 
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Table 8 Mean percentage of misclassified points and mean running time on clustering MNIST data set (D=50). 



subsets 


[12] 


[13] 


[17] 


[4 7] 


[2 4 8] 


[3 6 8] 


[12 3] 


K 


2 


2 


2 


2 


3 


3 


3 


LBF 


e% 
t(s) 


20.5 
2.8 


13.1 
2.8 


18.2 
2.6 


30.2 
3.1 


26.3 
5.2 


24.1 
5.1 


19.2 

4.7 


LBF-MS 


e% 
t(s) 


12.5 
1.3 


16.9 
1.3 


10.7 
1.3 


19.1 
1.3 


23.5 
2.3 


27.3 
2.3 


24.3 
2.3 


SLBF 


e% 
t(s) 


8.3 
15.1 


4.3 
15.0 


2.3 

14.6 


13.8 
16.8 


4.3 
37.5 


17.5 

38.5 


21.7 
39.5 


SLBF-MS 


e% 
t(s) 


5.5 
11.8 


3.3 

12.3 


5.0 
12.3 


5.5 

12.5 


3.2 

34.3 


18.5 
36.9 


21.7 
34.4 


ALC 
(voting with K) 


e% 
t(s) 


47.0 
1469.2 


46.0 
1445.6 


48.8 
1489.2 


100.0 
679.0 


100.0 
1530.1 


100.0 
1528.5 


65.3 
3032.4 


ALC 
(e from LBF) 


e% 
t(s) 


50.5 
93.0 


50.8 
93.6 


50.5 
91.0 


99.8 
9.4 


99.8 
18.2 


99.8 
17.9 


67.0 
163.5 


sec 


e% 
t(s) 


5.8 
0.9 


4.9 
1.0 


5.3 
1.1 


17.1 
0.9 


23.0 
1.6 


29.7 
2.0 


33.6 
1.7 


SCC-MS 


e% 
t(s) 


5.1 
0.9 


5.4 
1.0 


5.1 

1.2 


26.2 
1.0 


28.6 
1.8 


41.7 
1.9 


33.0 
2.0 


GPCA 


e% 
t(s) 


N/A 
N/A 


N/A 
N/A 


N/A 
N/A 


N/A 
N/A 


N/A 
N/A 


N/A 
N/A 


N/A 
N/A 


K-flats 


e% 
t{s) 


10.9 
2.8 


14.9 
2.9 


13.5 
2.9 


30.4 
3.1 


45.3 
6.2 


41.6 
5.6 


26.9 
5.1 


SSC 


e% 
t(s) 


16.8 
411.8 


2.0 

402.7 


3.2 
395.1 


20.0 
396.0 


11.3 
760.9 


17.8 
763.1 


45.5 
777.0 



Table 9 The standard deviation to the mean percentage of misclassified points on clustering MNIST data set (D=5 for GPCA, D=10 for other 
algorithms). 



subsets 


[12] 


[13] 


[17] 


[4 7] 


[2 4 8] 


[3 6 8] 


[12 3] 


K 


2 


2 


2 


2 


3 


3 


3 


LBF 


3.5 


4.1 


10.0 


11.4 


11.6 


8.3 


9.5 


LBF-MS 


5.9 


3.8 


10.0 


10.0 


10.3 


7.2 


7.8 


SLBF 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


SLBF-MS 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


ALC(voting with K) 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


ALC(e from LBF) 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


sec 


2.3 


2.7 


4.6 


9.9 


9.4 


7.5 


11.9 


SCC-MS 


2.0 


3.7 


5.2 


10.2 


8.3 


8.5 


9.2 


GPCA 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


K-flats 


7.6 


8.5 


7.8 


5.7 


7.4 


7.5 


5.9 


SSC 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 



Table 10 The standard deviation of the mean percentage of misclassified points on clustering MNIST data set (D=50). 



subsets 


[12] 


[13] 


[17] 


[4 7] 


[2 4 8] 


[3 6 8] 


[12 3] 


K 


2 


2 


2 


2 


3 


3 


3 


LBF 


5.6 


8.0 


8.3 


10.6 


11.0 


6.0 


6.0 


LBF-MS 


8.7 


10.5 


11.4 


11.2 


12.3 


8.9 


9.1 


SLBF 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


SLBF-MS 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


ALC(voting with K) 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


ALC(e from LBF) 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


sec 


0.6 


1.0 


0.9 


10.3 


3.7 


4.3 


13.9 


SCC-MS 


0.4 


0.7 


0.9 


15.5 


5.4 


4.5 


5.8 


GPCA 


N/A 


N/A 


N/A 


N/A 


N/A 


N/A 


N/A 


K-flats 


7.2 


11.3 


11.1 


7.5 


7.3 


8.1 


7.7 


SSC 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 


0.0 



0.05 (chosen by trying different values and picking the one 
obtaining the lowest error). This algorithm is independent 
of other parts of the GPCA algorithm and is thus extremely 
fast and can perform in high ambient dimensions. We even 
tried other ideas of [28, eqs. (3.28), (3.29)] (for the same 
given dimension d), while applying them to several HLM 
algorithms (even though they were originally presented for 
GPCA). Nevertheless, they did not work well and we thus 
did not report them. Each experiment is repeated 100 times 
(except for SOD(SSC), which is repeated 10 times due to its 



low speed) and the error rates of finding the number of clus- 
ters K and the computation time (in seconds) are recorded 
in Table E] 

As in Table ALC (oracle) and ALC (e from LBF) 
work the best for low dimensions (d = 1,2,3), but in real 
problems this choice (the noise level) for e is usually un- 
known. The local best-fit flat heuristic provides a good esti- 
mation for the distortion rate and helps ALC reduce its run- 
ning time. ALC (voting) is not as good as SOD (LBF) for 
artificial data. All options of ALC suffer from the computa- 
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Table 11 The mean percentage of incorrectness (e%) for finding the number of clusters K and the computation time in seconds t(s) on artificial 
data. 





no minimum angle 


minimum angle — 7r/8 




i" e R b 


2 4 e R b 


3 J e R b 


1" e R J 


2 4 e r j 


3 J e R 4 




i b e R J 


2 4 e R J 


3 d e R 4 


io 2 e R lb 


SOD 


e% 


22 


2 





58 


32 


12 





2 


6 


5 





(LBF) 


t(s) 


10.43 


13.76 


14.83 


9.84 


13.08 


14.49 


34.16 


9.95 


13.22 


14.47 


34.04 


SOD 


e% 


13 


1 


3 


67 


33 


9 





3 


8 


6 





(LBF-MS) 


t(s) 


8.70 


11.90 


12.92 


8.37 


11.54 


12.84 


27.56 


8.42 


11.60 


12.84 


27.69 


SOD 


e% 


75 


10 


5 





90 


95 


55 





85 


90 


55 


(SLBF) 


t(s) 


1097.19 


2148.06 


2895.85 


1076.24 


1774.74 


2629.26 


16124.50 


1224.96 


2387.70 


2830.83 


16510.13 


SOD 


e% 


90 


95 


70 





90 


85 


85 





75 


80 


80 


(SLBF-MS) 


t(s) 


908.76 


2094.68 


3141.77 


927.25 


1740.03 


2695.59 


15754.05 


990.88 


2302.66 


3010.64 


16493.95 


ALC 


e%(K) 


24 


12 


11 


32 


30 


17 


100 


5 


9 


9 


100 


(voting) 


t(<0 


2094.75 


2700.07 


3530.26 


1207.54 


2346.69 


3628.24 


119584.04 


1184.08 


2354.19 


3956.05 


117353.17 


ALC 


e%(K) 


1 





1 


20 


20 


3 


58 





3 


1 


63 


(e from LBF) 


t(s) 


23.72 


43.50 


57.50 


19.76 


36.67 


53.25 


1516.02 


19.81 


36.60 


53.01 


1770.77 


ALC 


e%(K) 


1 








34 


31 


1 


16 





10 


1 


13 


(oracle) 


t(s) 


23.74 


43.44 


59.14 


20.49 


37.49 


53.59 


1370.92 


20.22 


37.41 


54.11 


1354.11 


GPCA 


e% 


88 


100 


100 


27 


100 


100 


100 


13 


100 


100 


100 


t(s) 


0.03 


0.09 


0.12 


0.06 


0.09 


0.12 


1.30 


0.04 


0.09 


0.12 


1.30 


SOD 


e%(K) 


35 


21 


1 


63 


39 


17 





9 


32 


11 


1 


(SCC) 


t(s) 


32.09 


61.26 


95.79 


25.83 


59.41 


76.13 


475.45 


26.74 


41.95 


61.53 


466.79 


SOD 


e%(K) 


71 


32 


2 


80 


50 


12 





46 


33 


3 





(SCC-MS) 


t(s) 


31.78 


67.77 


111.15 


22.29 


55.25 


74.07 


475.50 


24.53 


51.98 


75.03 


471.31 


SOD 


e%(K) 


10 


80 


70 


100 


70 


70 


100 


50 


80 


80 


100 


(SSC) 


t(s) 


39.88 


2634.80 


3039.55 


1708.37 


2447.01 


2925.27 


14918.10 


1452.43 


2101.84 


2641.68 


14227.32 



tion complexity, especially ALC (voting). SOD (LBF) and 
SOD (LBF-MS) get reasonable outputs and have obvious 
advantage of computing time. GPCA is very fast, but does 
not work well. 

3.5.2 Finding the number of clusters on the extended Yale 
face database B 

We use the extended Yale face database B as in Section [33] 
for testing the above algorithms for detecting the number 
of clusters. The ambient dimension is reduced to D = 5K 
by PCA for all of the methods and the intrinsic dimension 
of subspaces is set as d = 5 (see Section [3~3l . For SOD 
with different clustering algorithms, we let K max = 6, 8, 
8, 10 and 10 respectively for 2 to 6 clusters. For GPCA, we 
let tolerance be 0.05 which does not affect the performance 
in this experiment. Each experiment is repeated 500 times 
(except for SOD(SSC), which is repeated 10 times due to 
its low speed). Following Section [3731 we apply LBF, LBF- 
MS, SLBF and SLBF-MS with whitening. The error rates of 
finding the correct number of clusters and the computation 
time are recorded in Table [T2l 

We see from Table [T2lthat SOD only performs well with 
SSC with K smaller than 4. We note that this is due to 
the difficulty of the database. Indeed for a simpler database 
such as Yale Face database B [ 14] of uncropped faces, SOD 
(SLBF), SOD (SLBF-MS), ALC (e from LBF) and ALC 
(voting) have perfect detection for K < 10 (whitening is 
not applied then). 

3.5.3 Finding the number of clusters on MN 1ST data set 

We preprocess MNIST data set exactly the same way as we 
did in Section [3~4l The ambient dimension is reduced to both 



D = 10 and D = 50 by PCA for all of the methods in- 
cluding GPCA and 3 is given as the intrinsic subspace di- 
mension. For SOD with different clustering algorithms, we 
let -Kmax = 6, and 8 respectively for 2 and 3 clusters. For 
GPCA, we let the tolerance be 0.05 which does not affect 
the performance in this experiment. Each experiment is re- 
peated 500 times (except for SOD(SSC), which is repeated 
10 times due to its low speed). The error rates of finding 
the correct number of clusters and the computation time are 
recorded in Tables [T3l and [T4l 

For all the methods, determining the number K of clus- 
ters becomes very difficult when the real K is larger than 3. 
For real K < 3, we see from Table [T3l that when we project 
data to 10-dimensional space, ALC and GPCA fail in most 
cases, except for ALC (e from LBF) on digits [3 6 8]. SOD 
(SLBF), SOD (SLBF-MS) and SOD (SSC) outperform all 
others although they are not very efficient. 

3.6 Initializing if-flats with the local best-fit heuristic 

Here we demonstrate that our choice of neighborhoods in 
Algorithm Q] can be used to get a more robust initialization 
of i-T-flats. We work with geometric farthest insertion. For 
fixed neighborhood sizes, say of rn neighbors, this goes as 
follows: we pick a random point xo and then find the best-fit 
flat Fq for the m point neighborhood of xo. Then we find 
the point xi in our data farthest from Fq, find the best-fit flat 
Fi of the m neighborhood of xi, and then choose the point 
X2 farthest from Fq and F\ to continue. We stop when we 
have K flats; we use these as an initialization for A'-flats. 

We work on three data sets. Data set #1 consists of 1500 
points on three parallel 2-planes in K 3 . 500 points are drawn 
from the unit square in x, y plane, and then 500 more from 
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Table 12 The mean percentage of incorrectness (e%) for finding the correct number of clusters K and the computation time in seconds t(s) on 
the extended Yale face database B. 



Real K 


2 


3 


4 


5 


6 




e%(K) 


62 


61 


69 


78 


84 




t(s) 


1.30 


3.60 


5.69 


11.30 


15.84 


SOD 


e%(K) 


65 


75 


78 


81 


80 


(LBF-MS) 


t(s) 


0.67 


1.65 


2.49 


4.90 


6.83 


SOD 


e%(K) 


24 


60 


70 


86 


98 


(SLBF) 


t(s) 


115.97 


303.02 


338.35 


729.74 


811.40 


SOD 


e%(K) 


20 


60 


76 


92 


96 


(SLBF-MS) 


t(s) 


106.87 


272.88 


306.22 


649.50 


721.42 


ALC 


e%(K) 


100 


100 


100 


100 


100 




t(s) 


42.99 


122.29 


258.20 


451.07 


699.52 


' ALC 


e%(K) 


42 


36 


76 


86 


100 


(£ from LBF) 


t(s) 


0.95 


2.49 


5.54 


11.54 


24.38 


GPCA 


e%(K) 


100 


100 


100 


100 


100 


t(s) 


0.07 


0.13 


0.52 


0.71 


1.02 


SOD 


e%(K) 


100 


8 


12 


28 


38 


(SSC) 


t(s) 


172.50 


389.66 


567.39 


1015.99 


1336.57 



Table 13 The mean percentage of incorrectness (e%) for finding the correct number of clusters K and the computation time in seconds t(s) on 
MNIST data set (D=10). 



subsets 


[12] 


[1 3] 


[17] 


[4 7] 


[2 4 8] 


[3 6 8] 


[12 3] 


K 


2 


2 


2 


2 


3 


3 


3 




e% 


16.8 


3.8 


50.8 


50.4 


75.6 


70.0 


54.8 




t(s) 


3.5 


3.2 


3.0 


3.3 


7.7 


7.5 


7.3 


SOD 


e% 


9.6 


6.6 


33.4 


68.2 


80.4 


76.6 


44.2 


(LBF-MS) 


t(s) 


1.9 


1.9 


1.9 


1.8 


4.6 


4.6 


4.7 


SOD 


e% 


0.0 


0.0 


0.0 


0.0 


0.0 


100.0 


0.0 


(SLBF) 


t(s) 


173.9 


164.6 


160.3 


248.6 


710.1 


610.9 


548.5 


SOD 


e% 


0.0 


0.0 


0.0 


0.0 


0.0 


100.0 


0.0 


(SLBF-MS) 


t(s) 


164.6 


159.9 


150.1 


228.5 


676.6 


586.4 


556.2 


ALC 


e% 


100.0 


100.0 


100.0 


100.0 


100.0 


100.0 


100.0 


(voting) 


t(s) 


830.4 


823.2 


813.2 


753.2 


1789.5 


1871.8 


1987.5 


ALC 


e% 


100.0 


100.0 


100.0 


100.0 


100.0 


0.0 


100.0 


(£ from LBF) 


t(s) 


23.2 


22.5 


21.5 


22.9 


55.6 


54.7 


54.0 




e% 


100.0 


100.0 


100.0 


100.0 


100.0 


100.0 


100.0 




t(s) 


1.0 


1.0 


1.0 


1.1 


2.8 


2.8 


2.7 


SOD 


e%(K) 


3.8 


7.8 


66.4 


81.8 


64.4 


47.6 


82.6 


(SCC) 


t(s) 


14.5 


13.3 


14.7 


16.9 


37.5 


34.1 


35.0 


SOD 


e%(K) 


2.4 


16.4 


53.0 


77.4 


70.4 


49.6 


77.8 


(SCC-MS) 


t(s) 


13.7 


13.8 


13.5 


16.4 


38.0 


35.6 


29.4 


SOD 


e%(K) 


0.0 


0.0 


0.0 


100.0 


0.0 


100.0 


100.0 


(SSC) 


t(s) 


233.6 


210.3 


213.3 


218.4 


380.0 


386.4 


390.5 



Table 14 The mean percentage of incorrectness (e%) for finding the correct number of clusters K and the computation time in seconds t(s) on 
MNIST data set (D=50). 



subsets 


[12] 


[13] 


[1 V] 


[4 7] 


[2 4 8] 


[3 6 8] 


[12 3] 


K 


2 


2 


2 


2 


3 


3 


3 


SOD 


e% 


45.0 


35.0 


54.0 


79.0 


72.0 


67.0 


60.0 


(LBF) 


t(a) 


22.9 


23.5 


22.2 


24.9 


56.2 


54.6 


51.1 


SOD 


e% 


32.0 


22.0 


38.0 


66.0 


44.0 


82.0 


58.0 


(LBF-MS) 


t(s) 


12.2 


12.2 


12.2 


12.2 


29.3 


29.4 


29.4 


SOD 


e% 


0.0 


0.0 


0.0 


0.0 


0.0 


100.0 


100.0 


(SLBF) 


t(s) 


204.2 


198.1 


207.8 


295.8 


864.5 


766.5 


706.1 


SOD 


e% 


0.0 


0.0 


100.0 


0.0 


0.0 


100.0 


100.0 


(SLBF-MS) 


t(s) 


213.7 


201.7 


176.6 


259.9 


748.1 


640.0 


681.1 


ALC 


e% 


100.0 


100.0 


100.0 


100.0 


100.00 


100.0 


100.0 


(voting) 


t(s) 


1469.2 


1445.6 


1489.2 


679.0 


1530.1 


1528.5 


3032.4 


ALC 


e% 


100.0 


100.0 


100.0 


100.0 


100.0 


100.0 


100.0 


(e from LBF) 


t(s) 


93.0 


93.6 


91.0 


9.4 


18.2 


17.9 


163.5 


GPCA 


e% 


N/A 


N/A 


N/A 


N/A 


N/A 


N/A 


N/A 


t(s) 


N/A 


N/A 


N/A 


N/A 


N/A 


N/A 


N/A 


SOD 


e%{K) 


0.0 


4.0 


1.0 


50.5 


78.8 


30.3 


83.8 


(SCC) 


t(s) 


14.9 


10.6 


11.6 


11.6 


24.7 


26.2 


25.4 


SOD 


e%(K) 


0.0 


0.0 


0.0 


42.4 


89.9 


97.0 


93.9 


(SCC-MS) 


t(s) 


12.6 


13.0 


14.7 


13.9 


34.0 


36.8 


30.7 


SOD 


e%(K) 


0.0 


0.0 


0.0 


0.0 


0.0 


100.0 


100.0 


(SSC) 


t(s) 


426.4 


417.6 


409.3 


413.5 


823.8 


821.2 


836.8 
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Fig. 3 Color map of neighborhood size obtained by the local best- 
fit flat heuristic. The color value represents the number of neighbors 
chosen at that point. Note that the algorithm chooses smaller neigh- 
borhoods for points closer to the intersection of the planes. 



the x, y,z + .2 plane, and then 500 more from the x, y,z + A 
plane. This data set is designed to favor the use of small 
neighborhoods. The next data set is three random fiats with 
15% Gaussian noise and 5% outliers, generated using the 
Matlab code from GPCA, as in Section [3~T1 This data set is 
designed to favor large neighborhood choices. Finally, we 
work on a data set with 1500 points sampled from 3 planes 
in R 2 as in Figure [3] The error rates of if-flats with farthest 
insertion initialization with fixed neighborhoods of size 10, 
20, 160 are plotted against the error rates for farthest in- 
sertion with adapted neighborhoods (searched over the same 
range), averaged over 400 runs in Figure [4] Although our 
method did not always beat the best fixed neighborhood, it 
was quite close; and it always significantly better than the 
wrong fixed neighborhood size. Both methods did signifi- 
cantly better than a random initialization. 

In Figure [3] we plot the number of neighbors picked by 
our algorithm for each point of a realization of data set #3. 



ness conditions, we will be able to approach the problem of 
clustering data which lies on unions of smooth manifolds. 

We also believe that it will be possible to provide a the- 
oretical framework for performance guarantees with noise 
for LBF and SLBF Specifically, we hope to prove a quan- 
titative form of the following alternative: suppose the data 
lies on the union of d-dimensional affine sets, perhaps with 
additive noise and outliers. Then either 

1. Most points are roughly as close to an affine set they 
don't belong to as they are to their nearest 0(d) neigh- 
bors; 

2. A large fraction of the points have optimal neighbor- 
hoods contained in only one of the affine clusters, the 
principal components of these neighborhoods are good 
approximations to the clusters; and LBF and SLBF re- 
cover good approximations to the two affine clusters, or 

3. The data looks locally lower than e?-dimensional, even 
though each cluster is globally d-dimensional, and has 
high curvature; in this case, there are pure optimal neigh- 
borhoods, but the local estimation does not accurately 
represent the affine clusters. 



A Proof of Theorem Q] 

Assume without loss of generality that i* = 1. Note that when r < tq, 
-B(x* , r) n T(Li , w) = _B(x* , r) n supp(/i) and that Li is the mini- 
mizer of the RHS of {2j- Combining these observations with (2) and the 
fact that /?2 (x* , r) is invariant to scaling of r and w, we immediately 
obtain that for r < ro : 



/3|(xV) = 



St (Li, 



T )nfl(xM) dist ( x *' i i) 2d / i i 



max(r,it;) 



ins(x*,i) 



(15) 



In particular, /?2 (x* , r) is constant for all < r < w. 

To show that /32 (x* , r) is strictly decreasing whenever w < r < 
ro, we first note that for any ri and r2 satisfying w < n < r-z < ro: 



4 Conclusions and future work 

We presented a very simple geometric method for HLM 
based on selecting a set of local best-fit flats. The size of the 
local neighborhoods is determined automatically using the 
£2 P numbers; it is proven under certain geometric condi- 
tions that our method approximately finds the optimal local 
neighborhoods. We give extensive experimental evidence 
demonstrating the state of the art accuracy and speed of the 
algorithm on synthetic and real hybrid linear data. 

We believe that one promising next step is to adapt the 
method for multi-manifold clustering. As it is, our method, 
while quite good at unions of flats, cannot successfully han- 
dle unions of curved manifolds. We expect that by gluing to- 
gether groups of local best-fit flats related by some smooth- 



T(Li 



max(r2,ro) 



)n8(x*,l) C T(Li, 



max(ri , w) 



)nB(x',l). 



Moreover, any point in T(L 



5 max(ri ,w} 



\T(L 



1 ' max(?'2 ,w) 



(16) 



has 



). Com- 



a larger distance to L\ than any point in T(L%, max ( r2 w ) , 
bining these observations with U5\ . we conclude that /32(x*,ri) > 
,02 (x*, r2), i.e., /32(x*, r) is strictly decreasing on [tu,ro]. 

Next, we will prove {5j with a weaker requirement on r* . More 
precisely, we define r* = max(rj, r^), where 



r a + 2w 



L 3V2(C-1)K-' 
(i)+l)ro(ro+2w) 

Tq + 2W 



when d = 1 ; 



= , when d > 1 



and 



/ 2 — ' 

V (ro+2™) 2 7J 



(17) 



(18) 
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Fig. 4 Using our neighborhood choice to improve initialization ofk-flats: the first row is the visualization of three data sets, and the seconds row 
shows the corresponding figures such that the vertical axis is accuracy, and the horizontal axis is fixed neighborhood size in geometric farthest 
insertion for initialization of K flats. The red line is the result of using adapted neighborhoods. The data sets are #1,#2, and #3 as described in 
Section \3~6\ Random initialization leads to misclassification rates of .4 or greater for all three data sets. 



We further assume that w < ro and 
(r* 2 - rg)~ ■ (ro + w) r* 



(r* 2 - r 2 )^ + (r* 2 - w 2 ) a Vd+1 



< ro- 



(19) 



We will later show that ^4} implies H9) and we will also verify that 
ro < r* < 1.09 ro. 

Without loss of generality we assume that argmin i>1 
dist(x*, Li) = 2, and let xo be the center of mass of p,± + p,2 in 
B(0, r). Then for any r > ro 

m l n j ^ dlSt ^ X,L - > ^ d/t>min J ^ dlst ( x ' L ) \ d(^i+|U2) 

B(x»,r) B(x*,r) 



/ dist(x, L) 



L:x £L J V r 
B(x«,r) 



d(/il + /i 2 ) 



'dist(x, L) 



/ dist(x, L) 



> min / | - v ""' — - ) dui + min / I " v ~"' — - ) du 2 

L J \ r J L: X0 £Lj \ ' 

B(x* ,r) 



/ dist(x, Li 



B(x* ,r) 



B(x*,r) 

'■ , /" /dist(x.L) . , 
d/xi + _nnn i / I ] d^ 2 - 



L:x £L % 

B(x*,r) 



(20) 



We claim that when r = r*, the minimizer in the second expression 
in the RHS of <20b (denoted by Ltf) satisfies that dim(Lo n L2) = 
d— 1 and (Lo n L^) ^2- We denote the orthonormal vector passes 
through x* and xo by ui, one of the d orthonormal vectors that span 
L2 by U2, and one of the D — d — 1 orthonormal vectors that span 
(span(L2)) x by U2. We will prove that ui is the top eigenvector of 
Jb(x* r*) ( x ~ x °)( x — x o) T °Vt(x), an d u 2 is the second top eigen- 



vector, by proving 



(uj (x- x )) 2 dAt 2 (x) 



B(x*,r*)nT(t 2 ,m) 

>/ (u 2 (x- xo)) 2 d M 2(x) 

JB(x*,r*)nT(L 2 ,«j) 

> / (u3(x-x )) 2 d At2 (x). 

Jfl(x*,r*)nT(L,,») 



(21) 



We note that 

(B(x',«i)nL|) x (^B(x*,Vf* 2 -^ 2 )nLi) cT(Li,»)n 

B(x*,r*) C (B(x*,to) nLf-) x (B(x*,r*)nLi) . (22) 

Defining y as nearest point to x* on L2, then for r* > ro + 2w, we 
have that 

(B(y, w) n L£) x (s(y, \A'* 2 - (r +2«;) 2 ) n L 2 ) C T(L 2 , to)n 



B(x*,r*) C (%,»)n4)x ^B(y, ^/r* 2 - r 2 ) n L 2 J . (23) 
Moreover 

vol (B(xVi) n L 1 -) x (B(x*,r 2 ) nl) = C {d,D- d) r°- d r^. 

(24) 

Denote the center of mass of £?(x* , r* ) n T(L 2 , by xi , notice that 
||xo — x* || < ro + id, the center of mass of -B(x*, r*) n T(Lx, w) is 
x*, and x*, xq and xi satisfies 



xo ; 



vol(B(x*,r*) nT(Li,m))x*+vol(B(x*,r*) n T(L 2 , w)) xi 
vol(B(x*,r*) nr(Li,tu)) + vol(B(x*,r») nT(L 2 ,w)) 

(25) 
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Combining {22}, i23\ , 124b and i25) we have the estimation 

vol(B(x*,r*) nT(L 2 ,u))) (r + w) 



I xn— xi <- 



"vol(B(x*,r*) n T(Li,u>))+vol(B(x*,r*) n T(L 2 ,w)) 
(r* 2 — Tq) 2 



(r* 2 — w 2 ) 2 + (r* 2 — r 2 ) 2 



- ■ (r +w). 



(26) 



Therefore for any point xi in B(x*, r*) n T(L 2 , w), using 09b 
and )26t we have 



To estimate /3 2 (x* , r* ) and /fe (x*, ro), using integration the points 
in (B(x*.ri) nL 1 ) x (B(x*, 7-2) n L) has an average squared dis- 
tance 2 r l to ^- Besides, the points in (-B(y, ui) n ) X 
(£(y, ^/r* 2 - (r + 2u>) 2 ) n L 2 ) has an average squared distance 

at least (r* 2 — (ro + 2u>) 2 )/3 to the minimizer L in 420t . Combining 
these facts with <20t . (22}, (23}, and {24}, we have 

n=d w D-d+2 r *d +w D-d, r *2_ ( ro + 2 «,) 2 )^/3 
2 (x* ,r*)>— ~ d+2 - - — — 



|uf (xi -X )| > r - ||X -X*|| > rjj^j 



(wD-d r *d + (K - l)w D - d (r* 2 - r 2 )i) 



and 



and 



/ (uf (x-x )) 2 d /t2 (x) > -^-/X2(-B(x*,r*)nT(La,to)). 

Jfl(x* ,r*)nT(i a ,i«) a+ 1 

(27) 

Since any points in -B(x*, r*) nT(L2, w) has a distance to xo smaller 
than r*, we have 

/ (uf(x-x )) 2 +d / (uf(x-Xo)) 2 

J B(x.* ,r')nT(L 2 ,m) J B (x* ,r * ) nT(L 2 ,m) 

/ ||x- x || 2 d/ t2 (x) < r* 2 At 2 (B(x*,r*) nT(L 2 ,u.)). 

JB(x*,r«)nT(L 2 ,ii>) 

(28) 



/82OV0) < 



D-d w 



(32) 



(33) 



Combining 427t and l !28t . the first inequality in <21t is proved 

4 



By direct integration one obtains that the average of (uT^xi) 2 for 



xi in 

(B(y, w) n L2) x (B(y, ^r* 2 - (r + 2«,) 2 )) , 
' s dT^^* 2 ~ ( r ° + 2u>) 2 ), and the average of (u^xi) 2 forxi in 
T(i 2l t«)\((B(y,ffl)n4) x (B(y, ^* 2 -('o+2«)) 2 ))) 
is larger than that of the set 

(B(y, w) n L2) x (B(y, ^r* 2 - (r + 2«;) 2 )) . 
Applying these two facts, we obtain the estimate 
(uf(x* -x )) 2 d/i 2 

B(x*,r*)nT(L2,in) 

> -7-7-77 ('"* 2 - (r +2») 2 ) /l2 (B(x*,r*)nT(L 2 ,»)). (29) 



We also have 



/ (U3(x*-x 

J B(x* ,r* )nT(L 2 ,w) 



)) 2 dAt 2 < w 2 ti 2 (B(x*,r*)nT(L 2 ,w)). 



(30) 



Using the fact that r* > rj, we have 

r* 2 - (r + 2to) 2 > r* 2 - (r + 2w) 2 = (r + 2u>) 2 - 



ro — 4tqw — Aw 2 



: j - (™±M? „ . (roW + 4w 2 ) 



ro — 4ro«> — 4«> 2 



> 4ur 



(3D 



Combining j29t . <30b and )31t . the second inequality in \2\\ is also 
proved. 



To prove (5}, we only need to prove that the RHS of (32\ is larger 
than the RHS of ((33}, which has a following simplified form: 



D-d 



3(D 



+ 2 / (r + 2u>) 2 \ ' 
-d) ^ r* 2 J 



r r r r 

When a! = 1, (34} follows from 



(34) 



D + l \ ( _ (rp + 2w) 
3(D-1)J \ 

From r* > rj, we have 

\2\ 2 



2\ 3 



/ (r + 2 W ) 2 V 18(B-d) 2 K 2 
1, r *2 J - ( D _ d + 2 )2 r 2 (r + 2 W ) 2 ' ' ' 



and from r* > r^ we have 



2 ( 1 1 ) > 1 1 

(ro + 2ui) 2 r* 2 — r 2 r* 2 



(37) 



Then (35} follows from (36} and )37t . and therefore (5} is proved 
for the case d = 1 . 

For the case d > 2, the proof of (34} follows a similar strategy. 
Combing (37 1 and 



(r + 2u>) 2 \ ~ > 6(D - d)K u> 2 

J - (D-d + 2) (r + 2u>) 2 ! 



we obtain 



(38) 



1) 

~3(D - d) V~ r * 2 

and (34} follows from (39t . 

Now we will prove that (4} satisfies U9) . Notice that r* 2 — w 2 > 

(ro + 2ui) 2 — ui 2 > r 2 ,, it is sufficient to prove 



ro + ui 



1 + va+r 



<r . 



and since r* > r* > (ro + 2w)/y/l — c > (1 + 2c)ro/Vl — c an( l 
w < cro, where c = 0.02, we only need to prove 



1 + c 



1 + 2c 



I V(d+l)(l-c) 



< 1. 



(40) 



20 



Teng Zhang et al, 



It holds for c = 0.02 and d = 1. Since that when c is fixed, d = 

1 maximizes the LHS of {40]l, (40) holds for any d with c = 0.02. 
Therefore ((4) satisfies 4 19t and Theorem^is proved. 

At last we will show that r* < 1.09 ro. Indeed, r\ < ro(l + 

2 ■ 0.02)7^1-0.02 < 1.09 r , and r\ < r < 1.09r , 

V 1.04 2 ^ 

therefore r* = maxfrj.rj) < 1.09ro. 

Remark 1 The function /32{x,r) often does not have a local mini- 
mum at exactly ro. We demonstrate it for a particular case, but it 
is evident that this is rather typical. Assume that K = 2, d = 1, 
D = 2 and L\ _L L2, then for sufficiently small rj, {/3(x*,r) n 
T(Z/2,ui2)} C T(L\, /?2(x*, ro)). Following the same argument for 
the interval [wi* , ro], /?2(x*, r) is decreasing in the interval [ro,ro + 
„]. 
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